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About This Manual 


Preface 


Read This First 


This manual is a reference for programming TMS320C62xx digital signal pro- 
cessor (DSP) devices. 


Before you use this book, you should install your code generation and debug- 
ging tools. 


This book is organized in four major parts: 


(+ Part!: Introduction includes a brief description of the *C62xx architecture 
and code development flow. It also includes a tutorial that introduces you 
to the tools you will use in each phase of development. 


Lj Part Il: C Code includes C code examples and discusses optimization 
methods for the code. This information can help you choose the most 
appropriate optimization techniques for your code. 


.j PartIll: Assembly Code describes the structure of assembly code. It also 
provides examples and discusses optimizations for assembly code. 


Lj) PartIV: Appendix provides extensive code examples from the GSM EFR 
vocoder. 
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Related Documentation From Texas Instruments 


The following books describe the TMS320C62xx devices and related support 
tools. To obtain a copy of any of these TI documents, call the Texas Instru- 
ments Literature Response Center at (800) 477-8924. When ordering, please 
identify the book by its title and literature number. 


TMS320C6x Assembly Language Tools User’s Guide (literature number 
SPRU186) describes the assembly language tools (assembler, linker, 
and other tools used to develop assembly language code), assembler 
directives, macros, common object file format, and symbolic debugging 
directives for the ’C6x generation of devices. 


TMS320C6x Optimizing C Compiler User’s Guide (literature number 
SPRU187) describes the ’'C6x C compiler. This C compiler accepts ANSI 
standard C source code and produces assembly language source code 
for the ’‘C6x generation of devices. This book also describes the 
assembly optimizer, which helps you optimize your assembly code. 


TMS320C6x C Source Debugger User’s Guide (literature number 
SPRU188) tells you how to invoke the ’C6x simulator and emulator 
versions of the C source debugger interface. This book discusses 
various aspects of the debugger, including command entry, code 
execution, data management, breakpoints, profiling, and analysis. 


TMS320C62xx CPU and Instruction Set Reference Guide (literature 
number SPRU189) describes the ’C62xx CPU architecture, instruction 
set, pipeline, and interrupts for the TMS320C62xx digital signal proces- 
sors. 


TMS320 DSP Designer’s Notebook: Volume 1 (literature number 
SPRT125) presents solutions to common design problems using ’C2x, 
’C8x, ’C4x, 'C5x, and other TI DSPs. 


TMS320C62xx Peripherals Reference Guide (literature number SPRU190) 
describes common peripherals available on the TMS320C62xx digital 
signal processors. This book includes information on the internal data 
and program memories, the external memory interface (EMIF), the host 
port, serial ports, direct memory access (DMA), clocking and phase- 
locked loop (PLL), and the power-down modes. 


TMS320C6201 Digital Signal Processor Data Sheet (literature number 
SPRS051) describes the features of the TMS320C6xx and provides pin- 
outs, electrical specifications, and timings for the device. 
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Solaris and SunOS are trademarks of Sun Microsystems, Inc. 
VelociTl is a trademark of Texas Instruments Incorporated. 


Windows and Windows NT are registered trademarks of Microsoft 
Corporation. 
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If You Need Assistance... 
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TI Online http://www.ti.com 
Semiconductor Product Information Center (PIC) —_http://www.ti.com/sc/docs/pic/home.htm 
DSP Solutions http:/Awww.ti.com/dsps 
320 Hotline On-line™ http://www.ti.com/sc/docs/dsps/support.htm 


North America, South America, Central America 


Product Information Center (PIC) (972) 644-5580 
TI Literature Response Center U.S.A. (800) 477-8924 
Software Registration/Upgrades (214) 638-0333 Fax: (214) 638-7742 

U.S.A. Factory Repair/Hardware Upgrades (281) 274-2285 

U.S. Technical Training Organization (972) 644-5580 

DSP Hotline (281) 274-2320 Fax: (281) 274-2324 Email: dsph@ti.com 
DSP Modem BBS (281) 274-2323 
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Europe, Middle East, Africa 
European Product Information Center (EPIC) Hotlines: 
Multi-Language Support +33 130 70 11 69 : +33 130701032 Email: epic@ti.com 
Deutsch +49 8161 80 33 11 or +33 1 30 70 11 68 
English +33 1 30 70 11 65 
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Asia-Pacific 
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Korea DSP Hotline +82 2551 2804 Fax: +82 2551 2828 
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Singapore DSP Hotline Fax: +65 390 7179 
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Product Information Center +0120-81-0026 (in Japan) Fax: +0120-81-0036 (in Japan) 
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Documentation 
When making suggestions or reporting errors in documentation, please include the following information that is on the title 
page: the full title of the book, the publication date, and the literature number. 
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Introduction 


Part I 


This chapter introduces some features of the ’C62xx microprocessor and 
discusses the basic process for creating code. 
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1.1 TMS320C62xx Architecture 
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The ’C62xx is a fixed-point digital signal processor (DSP) and is the first DSP 
to use the VelociT|™ architecture. VelociTl is a high-performance, advanced 
very-long-instruction-word (VLIW) architecture, making it an excellent choice 
for multichannel, multifunction, and performance-driven applications. 


The ’C62xx DSPs are based on the ’C62xx CPU, which consists of: 
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Program fetch unit 

Instruction dispatch unit 

Instruction decode unit 

Two data paths, each with four functional units 
Thirty-two 32-bit registers 

Control registers 

Control logic 

Test, emulation, and interrupt logic 


1.2 TMS320C62xx Pipeline 


The ’C62xx pipeline has several features that provide optimum performance, 
low cost, and simple programming. 


a 


L 


Increased pipelining eliminates traditional architectural bottlenecks in pro- 
gram fetch, data access, and multiply operations. 


Pipeline control is simplified by eliminating pipeline locks. 
The pipeline can dispatch eight parallel instructions every cycle. 


Parallel instructions proceed simultaneously through the same pipeline 
phases. 


Code Development Flow to Increase Performance 


1.3. Code Development Flow to Increase Performance 


You can achieve the best performance from your ’C62xx code if you follow this 
flow when you are writing and debugging your code: 


Phase 1: Write C code 
Develop C Code 


Yes 
Complete 
No 


Refine C code 
Phase 2: 
Refine C Code E 


Yes 
Complete 
No 


Yes More C 
optimization? 


-— Write linear assembly 
Phase 3: 


Write Linear 
Assembly Assembly optimize 


No 
: <Enal> 
Yes 


Complete 
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Code Development Flow to Increase Performance 


The following lists the phases in the three-step software development flow 
shown on page 1-3, and the goal for each phase: 


Phase Goal 


1 You can develop your C code for phase 1 without any knowledge of 
the ’C62xx. Use the ’C62xx profiling tools that are described in the 
TMS320C6x C Source Debugger User’s Guide to identify any ineffi- 
cient areas that you might have in your C code. To improve the per- 
formance of your code, proceed to phase 2. 


Part I 


2 Use the intrinsics, shell options, and techniques that are described 
in Chapter 3 of this book to improve your C code. Use the ’C62xx 
profiling tools to check its performance. If your code is still not as effi- 
cient as you would like it to be, proceed to phase 3. 


3 Extract the time-critical areas from your C code and rewrite the code 
in linear assembly. You can use the assembly optimizer to optimize 
this code. 


Chapter 2 


Code Development Flow Tutorial 


Part I 


This chapter walks you through the code development flow that was 
introduced in Chapter 1. It uses step-by-step instructions and code examples 
to show you how to use the software developmenttools in each phase of devel- 
opment. 


Before you start this tutorial, you should install the code generation tools and 
the C source debugger. If you do not have a Texas Instruments C source de- 
bugger, use your own debugger to check your results. 


The sample code that is used in this tutorial is included on the code generation 
tools CD-ROM. When you install your code generation tools, the example 
code is installed in the c6xtools directory. Use the code in that directory to go 
through the examples in this chapter. 


The examples in this chapter were run on the most recent version of the soft- 
ware development tools that were available as of the publication of this book. 
Because the tools are being continuously improved, you may get different re- 
sults if you are using a more recent version of the tools. 
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Before You Begin 


2.1 Before You Begin 
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This tutorial contains three basic types of information: 


Primary tasks 


Important information 


Optional tasks 


Primary tasks identify the main lessons in the 
tutorial; they are boxed so that you can find 
them easily. A primary task looks like this: 


On a command line, enter: 


load6x count.out 


In addition to primary actions, important infor- 
mation ensures that the tutorial works correctly. 
Important information is marked like this: 


| Important! | If you are using SunOS, be sure 


you reinitialize your shell before continuing with 
this tutorial. 


Optional tasks allow you to learn more about 
the ’C62xx tools; however, you do not need to 
perform the optional tasks to complete the tuto- 
rial successfully. Optional tasks are marked like 
this: 


Try This: | The stand-alone simulator (load6x) 
is another tool that you can use to find out what 
the cycle count for each function is... 


This tutorial is divided into lessons. Each lesson builds on the previous lesson. 
To get the most benefit from the tutorial, you should start at the beginning and 
work your way through to the end without skipping lessons or doing them out 


of order. 


Introduction to the Example Code 


2.2 Introduction to the Example Code 
The C code example that you will use to start this tutorial is demo1.c, which 


is shown in Example 2-1. This example calls three functions: mac1(), 
vec_mpy1(), and iir1(). 


Example 2-1. The Code Example—demo1.c 


main(int argc, char *argv[]) 


{ 


const short coefs[150]; 
short optr[150]; 
short state[2]; 
const short a[150]; 
const short b[150]; 
int c = 0; 

int dotp[1] = {0}; 
int sum= 0; 

short y[150]; 

short scalar = 3345; 
const short x[150]; 


sum = macl(a, b, c, dotp); 
vec_mpyl(y, x, scalar); 
iirl(coefs, x, optr, state); 


The maci() function, a multiply accumulate example, is shown in 
Example 2-2. It is performing a dot product, which is the squaring of a vector. 


Example 2-2. The Multiply Accumulate Function—mac1.c 


int macl(const short *a, const short *b, int sqr, int *sum) 
{ 

int i; 

int dotp = *sum; 


for (i = 0; i < 150; itt) 
{ 
dotp + 
sqr 


b{i] * alil; 
b{i] * blil; 


} 


*sum = dotp; 
LECUEN- Sorry 
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Introduction to the Example Code 


The vec_mpy( ) function shown in Example 2-3 is a vector multiply, which is 
a scalar multiply followed by a right shift. The result is stored to a second vec- 


tor. 


Example 2-3. The Vector Multiply Function—vec_mpy1.c 


ine i; 


i = 0; i < 150; itt) 
pay 


void vec_mpyl (short y[], const short x[], 


((scalar * x[i]) >> 15); 


short scalar) 


The third function, iir1( ), is atypical infinite impulse response (IIR) biquad filter. 


The code for this function is shown in 


Example 2-4. The Biquad Filter—iir1.c 


Example 2-4. 


void iirl(const short *coefs, 
{ 

Sshort:-x; 

short t; 

int. ny 


x = input[0]; 


for (n = 0; n < 50; n++) 


} 


*optrt++ = x; 


const short *input, 


short *optr, short *state) 


t = x + ((coefs[2] * state[0] + 
coefs[3] * state[1]) >> 15); 
x = t + ((coefs[0] * state[0] + 
coefs[1] * state[1]) >> 15); 
state[1] = state[0]; 
state[0] = t; 
coefs += 4; /* point to next filter coefs */ 
state += 2; /* point to next filter states */ 
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Lesson 1: Compiling, Assembling, and Linking the Example Code 


2.3 Lesson 1: Compiling, Assembling, and Linking the Example Code 


The first step is to compile, assemble, and link the code. 


On a command line, enter the following on a single line: 


cl6x -g -o -k -mg demol.c macl.c vec_mpyl.c iirl.c 
-z 1lnk.cmd -1 rts6201.1ib -o demol.out 


You should not receive any errors, and the file, demo1.out, should be created. 
If you receive an error message, look up that error message in the appropriate 
user’s guide. 


Here is a description of what you told the shell program (cl6x) to do: 


cl6x Run the compiler and the assembler. 

-g Generate symbolic debugging directives that are used by 
the debugger. 

—0 Invoke the optimizer at the default level. (—o is the same as 
—02.) 


Not all optimizations work well with debugging, because the 
optimizer’s rearrangement of code can make it difficult for 
you to correlate source code with object code. Using the —g 
option with the —o option allows for the maximum amount 
of optimization that is compatible with debugging. 


—k Keep the assembly output files. Notice that you now have 
the following .asm files in your current directory: main.asm, 
mac1.asm, vec_mpy1.asm, and iir1.asm. 


When the -k option is not used, the shell program deletes 
the assembly output files after assembly is complete. 


—mg Turn on the maximum amount of optimization that is com- 
patible with profiling. The —mg option allows you to profile 
optimized code. 


—Z Invoke the linker. The addition of this option to the cl6x com- 
mand line means that the code is compiled, assembled, 
and linked in one step. 


Ink.cmd Use Ink.cmd as the linker command file. Linker command 
files allow you to put linking information into a file, which is 
useful when you invoke the linker often with the same in- 
formation. 


Linker command files are also useful because they allow 
you to use the MEMORY directive, which defines the target 
memory configuration, and the SECTIONS directive, which 
controls how sections are built and allocated. 
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—Irts6201.lib Include the runtime-support library, rts6201.lib, which is in- 
cluded on your CD-ROM. 


The runtime-support functions in rts6201 .lib were compiled 
for little 


-endian mode. For big-endian mode, use the runtime sup- 
port functions in rts6201e.lib. 
—o demo1.out Name the output file demo1.out. (The default is a.out.) 


Because this option came after the —z option, it was consid- 
ered a linker option and was interpreted differently than the 
—o option that you entered before —z. 


Try This: 


The options above are used throughout the rest of this tutorial. 
They are fairly common and might be ones that you want to use repeatedly. 
To avoid having to retype them each time you run the code development tools, 
you can use the C_OPTIONS environment variable. The shell program uses 
the default options and/or input filenames that you name with the C_OPTIONS 
environment variable every time you run the shell. 


Use the commands in Table 2—1 to set up the C_OPTIONS environment vari- 
able with the options used on page 2-5. 


Table 2-1. Using the C_OPTIONS Environment Variable 


Your Setup 
Windows NT™ 


Windows™ 95 
C shell 


Bourne or Korn shell 


What to Change Command 
System applet SET C_OPTION=~g —0 —k —mg *.c —z Ink.cmd —| rts6201.lib 


autoexec.bat SET C_OPTION=~g —0 —k —mg *.c —z Ink.cmd —| rts6201.lib 
.cshrc setenv C_OPTION ”~g —0 -k —mg *.c —z Ink.cmd —| rts6201.lib” 
-profile setenv C_OPTION ”~g —0 -k —mg *.c —z Ink.cmd —| rts6201.lib” 


Notice that the -o demo1.out linker option was not included. If it were included, 
running the second tutorial example, demo2.c, would result in a output file 
named demo1.out instead of a more logical name such as demo2.out. 


Also, note that *.c is used. This tells the compiler to compile all of the C files 
in the current directory. Make sure that your current directory contains all of the 
C files that you want to compile and does not contain any additional C files that 
you do not want to compile. 


Lesson 1: Compiling, Assembling, and Linking the Example Code / Lesson 2: Profiling the Example Code 


| Important! | If you are using SunOS, be sure you reinitialize your shell before 
continuing with this tutorial: 


L 


Ly 


For C shells, enter the following on a command line: 


source ~/.cshrce 


For Bourne or Korn shells, enter the following on a command line: 


source ~/.profile 


2.4 Lesson 2: Profiling the Example Code 


Now, use the profiler to look at the output of demo1. In this lesson, you will use 
the profiler to see the total execution time in number of cycles of each C func- 
tion in demo1.out. 


To start the profiler and load demo1.out, follow these steps: 


Double-click the icon for the debugger. 


From the Profile menu, select Profile Mode. 


The debugger switches to profiling mode and displays only the Com- 
mand, Disassembly, File, and Profile windows. 


From the File menu, select Load Program. 


This displays the Load Program File dialog box. 


Double-click the demo1.out file. To do so, you might need to change the 
working directory. 


This loads demo1.out into the profiler. Because the File window is re- 
served for C programs, it disappears. 
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To select the areas of demo1 that you want profiled, follow these steps: 


From the Profile menu, select Select Areas. 


This displays the Profile Marking dialog box. 
In the Level box, select C. 


In the Area box, select Functions. 


Part I 


This indicates that the C functions in demo1. out will be your profile 
areas. 


Click Mark. 


Profile Marking | . 10) Ei 


- 4rea Marking 


~ Area 


© Lines. Start 

© Ranges, Star! ——- eat [| 
@ Functions 

© All areas 


Module: Nas ¥| Enable | 
Function: [Nia | Unmark | Disable | 


Close | Help | 


5) Click Close. 


The Profile window is updated to include a line for each C function in 
demo1. 
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To start the profiling session, follow these steps: 


1) Click the run icon on the toolbar: 


This displays the Profile Run dialog box. 


In the Run Method box, select Quick, no exclusive fields. This will show 
you the total execution time (cycle count) of a profile area, including the 
execution time of any subroutines called within the functions. 


If main( ) is not already selected as your starting point, choose it from 
the list of starting points. 


Profile Run x | 


Run Method 
© Full, all fields 


@ ‘Quick, no exclusive 


Often Never 
Display Rate: ey a CT A GO 


Start Point: [main ¥ 
Cancel | Help | 


4) Click OK. 


The Run Method dialog box closes and the status bar reads Target: 
Profiling to indicate that the profiling session has started. 
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The program restarts and runs to main( ) without profiling. Profiling begins 
when main( ) is reached and continues until the exit point of main( ) is reached. 
When profiling is complete, the status bar reads Target: Haltedand your Profile 
window looks like this: 


fim) Profile 


C Function 
C Function 
C Function 


C Function 


Area Name Count Inclusive Incl—Max 


iirl() 
macl{} 
main } 
vec_mpy1(} 


The Inclusive column indicates the cycle counts for each function, including 
any function that it calls. Because these functions do not call any other func- 
tions, the inclusive cycle counts are the same as the exclusive cycle counts. 
Notice that the cycle count for the mac1() function is 167, and that the cycle 
counts for the vec_mpy1() and iir1() functions are much higher—316 and 
270, respectively. 


To interpret the cycle counts in the Profile window, you need to understand how 
they are calculated. Here is the formula for calculating cycle counts: 


Execute packets x loop iterations in C code + constant 


An execute packet is a group of parallel instructions. You can have up to eight 
instructions executing in parallel; therefore, each execute packet can contain 
up to eight instructions. Execute packets are covered in more detail on 
page 2-14. 


Table 2—2 shows how the cycle counts were calculated for each function. 


Table 2-2. Cycle Counts 


Function 


maci() 
vec_mpy1( ) 
iir1() 


Execute Packets Loop Iterations Constant Cycle Count 
1 150 17 1 x 150+ 17 =167 
2 150 16 2 x 150+ 16 = 316 
5 50 20 5 x 50 + 20 = 270 
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Try This: 


The stand-alone simulator (load6x) is another tool that you can use 


to find out what the cycle count for each function is. To get cycle count informa- 
tion for each function with the stand-alone simulator, embed the clock( ) func- 
tion in your C code. Example 2—5 shows how to rewrite demot1.c to include the 


clock( ) function. 


Example 2-5. Including the clock( ) Function in demo 1.c (count.c) 


#include <stdio.h> 
#include <time.h> 


{ 


main(int argc, 


char *argv[]) 


const short coefs[150]; 


short optr[150]; 

short state[2]; 

const short a[150]; 

const short b[150]; 

int c = 0; 

int dotp[1] = {0}; 

int sum= 0; 

short y[150]; 

short scalar = 3345; 

const short x[150]; 

clock_t start, stop, overhead; 

start = clock(); 

stop = clock(); 

overhead = stop start; 

start = clock(); 

sum = macl(a, b, c, dotp); 

stop = clock(); 

printf (”macl cycles: %d\n"”, stop - start - overhead) ; 
start = clock(); 

vec_mpyl(y, x, scalar); 

stop = clock(); 

printf (”vec_mpyl cycles: %d\n”, stop - start - overhead) ; 
start = clock(); 

iirl(coefs, x, optr, state); 

stop = clock(); 

printf ("iirl cycles: %d\n"”, stop - start - overhead) ; 


ET | 


Note: 


When using this method, remember to calculate the overhead and include 


the appropriate header files. 
a | 
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Now, compile, assemble, and link count.c. 


If you did not set up your C_OPTIONS environment variable as described 
on page 2-6, enter the following on a command line: 


cl6x -g -o -k -mg count.c macl.c vec_mpyl.c iirl.c 
-z ink.cmd -1 rts6201.1ib -o count.out 


OR 


If you set up your C_OPTIONS environment variable as described on 
page 2-6, enter the following on a command line: 


cl6x -z -o count.out 


Although the —z option is already specified in the C_OPTIONS environment 
variable, you need to specify it on the command line to indicate that this oc- 
currence of —0 is a linker option. 


Use load6x to see the output of the printf statements that were embedded in 
the C code. 


On a command line, enter: 


load6x count.out 


You should see the following output: 


TMS320C6x C I/O COFF Loader Version 1.01 
Copyright (c) 1989-1997 Texas Instruments Incorporated 
Interrupt to abort 

macl cycles: 175 

vec_mpyl cycles: 324 

iirl cycles: 278 

NORMAL COMPLETION: 20949 cycles 


Notice that these cycle counts are higher than the cycle counts that you saw 
with the profiler. For example, mact is listed here as having 175 cycles; howev- 
er, it was listed in the Profiler window as having 167 cycles. You will see some 
extra cycles when you use load6x because you still have overhead for each 
function call. When you use the profiler, the cycles needed for calling the func- 
tions are not included in the profile display. 


The Using the Standalone Simulator chapter in the TMS320C6x Optimizing 
C Compiler User’s Guide discusses load6x in more detail. 
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2.5 Lesson 3: Phase 1 of the Code Development Flow 


Looking at the functions in demo1 one at a time, you can determine whether 
or not they need to be improved and, if they do need to be improved, how they 
can be improved. Start by looking at the first function, mac1( ). 


Example 2-6 shows the assembly output of the function’s inner loop kernel. 
The loop kernel is the area of the loop with the most parallelism. Only the inner 
loop is shown, because this is the area that can be improved with software pi- 
pelining. Notice that there are eight instructions executing in parallel (as indi- 
cated by the seven sets of parallel bars). This is the maximum number of 
instructions that the ’C62xx can execute in parallel, so this code does not need 
to be improved. 


Example 2-6. Inner Loop Kernel of mac1.asm 


L3: ; PIPED LOOP KERNEL 
ADD .L2 B4,B7,B7 i 
| ADD peal A5,A3,A3 ; 
| PY .M2X A4,B5,B4 7@@ 
| PY .M1 A4,A4,A5 7@@ 
| [ BO B “ol L3 7 @@ee@ 
| [ BO] SUB -S2 BO,1,B0 7 @CCEEE 
| LDH «D1 *A0++,A4 7 @@ACEEE 
| LDH .D2 *B6o++,B5 7 @@ACEEE 


The @ characters specify the iteration of the loop that an instruction is on in 
the software pipeline; these symbols are automatically created by the code 
generation tools. One @ character represents the first iteration, two @ charac- 
ters represents the second iteration, and so on. 


Because the mac1( ) function does not need to be improved, it does not need 
to go beyond phase 1 of the code development flow. 
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Execute packets 


[— 


Look at Example 2-7, which shows the assembly output of the innermost loop 
for the vec_mpy1() function. Recall from page 2-10 that the vec_mpy1( ) func- 
tion took 316 cycles to execute. This code is not as parallel as the mac1( ) func- 
tion. The assembly output for the vec_mpy1() function shows two execute 
packets. Each execute packet has four parallel instructions. This loop can be 


improved. 


Example 2-7. Inner Loop Kernel of vec_mpy1.asm 


L3: 


[ 


Al] 


Al] 


ED LOOP KERNEL 
.L2X A3,B6,B5 : 
3S1 L3 7@@ 
D2 *+B4 (6),B6 7@@@ 
-D1 *AQ++,A4 7 @@e@e 
.D2 B5, *B4++ i 
Soul A3,15,A3 7@ 
.M1 A5,A4,A3 7@@ 
-L1 Al,1,Al1 7@@@ 
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Example 2-8 shows the assembly output of the innermost loop for the iir( ) 
function. Recall from page 2-10 that the iir1() function took 270 cycles to 
execute. As you can see, some execute packets have five parallel instructions, 
while others have as few as four parallel instructions, which indicates that the 
code can probably be improved. 


Example 2-8. Inner Loop Kernel of jir1.asm 


L3: ; PIPED LOOP KERNEL 
SHR .S2 B4,15,B4 i 
SHR ,S2 A3,15,A5 : 
MPY .M2X B6,A5,B6 7@ 
LDH ee all *+A6(16),A4 ;@6@ 
LDH .D2 *+B7(10),B6 ;@@ 
ADD .L1 AO,A5,A0 i 
MPY .M1X B6,A3,A3 7@ 
MPY .M2X B5,A4,B5 7@ 
LDH -D1 *+A6(22),A3 ;@@ 
LDH .D2 *+B7(8),B5 7@@ 
EXT oa AO,16,16,A0 ; 
STH .D2 B5,*+B7 (6) 7@ 
PY .MLX B5,A3,A4 7@ 
LDH .D1 *+06(20),A3 ;@@ 
ADD SE 8,A6,A6 : 
STH .D2 AO,*B7++(4) ; 
ADD .L1X AO,B4,A0 i 
[ BO] SUB ~L2 BO,1,BO 7@ 
ADD £82 B6,B5,B4 7@ 
EXT aol AO,16,16,A0 ; 
[ BO] B ey L3 7@ 
ADD .L1 A3,A4,A3 7@ 
LDH .D1 *+06(18),A5 ;@@@ 


To improve the vec_mpy( ) andiir( ) functions, start by seeing how you can re- 
fine and improve your C code. This is what is referred to as phase 2 of the code 
development flow, and this is what the next lesson is about. 
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2.6 Lesson 4: Phase 2 of the Code Development Flow 


5 
a 


For your convenience, the vec_mpy1() function is duplicated here as 
Example 2-9 (the C version) and Example 2—10 (the assembly output of the 
inner loop). This is the same code that you saw in Example 2-3 and 
Example 2-7. 


Example 2-9. The Vector Multiply Function—vec_mpy1.c 


void vec_mpyl (short y[], const short x[], short scalar) 
{ 

int nee 

for 


i = 0; i < 150; itt) 
[ += ((scalar * x[i]) >> 15); 


Example 2—9 uses short data types. Because short data types are 16 bits, they 
translate into halfword instructions, such as LDH and STH (see 
Example 2-10). 


The loop in Example 2—10 uses two LDH instructions and an STH instruction 
to load x{i] and y[i] and store back to y[i]. Because only two memory operations 
can occur per cycle, the fastest that this loop can execute is one y[i] result ev- 
ery two cycles. This loop is limited by the number of D units. 


Example 2-10. Inner Loop Kernel of vec_mpy1.asm 


L3: ; PIPED LOOP KERNEL 

ADD .L2X A3,B6,B5 ; 

[ Al] B Sl L3 7@@ 
LDH D2 *+B4 (6) ,B6 7 @@@ 
LDH .D1 *AO++,A4 7@@e@ 
STH +D2 B5, *B4++ 7 
SHR OL A3,15,A3 7@ 
MPY M1 A5,A4,A3 7@@ 

[ Al] SUB -L1 Al1,1,Al1 7@@@ 


Because x is an array, x[i] and x[i + 1] are next to each other in memory. This 
means that instead of using halfword instructions (LDH and STH) to load and 
store each element in the array, you can use word instructions (.DW and STW) 
to load and store two elements at a time, as long as the data is aligned ona 
word boundary. In other words, all word accesses should have the 2 LSBs of 
the address set to zero. Two elements at a time, x[i] and x[i + 1], fit into one 
32-bit register. 


Example 2-11. 
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To achieve this in C, declare x[ ] as an integer instead of as a short data type. 
Also, you need to use some intrinsics. 


Now that you have determined that you can load x{i] and x[i + 1] into the same 
register, you need to figure out how to do it. You can do this by using the _mpy 
and _mpylh intrinsics. Intrinsics are like built-in C functions that correspond to 
’C62xx assembly language instructions. The _mpy intrinsic multiplies the 
16 LSBs of one operand by the 16 LSBs of another and returns the result. The 
_mpylh intrinsic multiplies the 16 LSBs of the first operand by the 16 MSBs of 
the second and returns the result. 


You can then use the _add2 intrinsic to add the 16 MSBs of the first operand 
to the 16 MSBs of the second operand. At the same time, the _add2 intrinsic 
also adds the 16 LSBs of the first operand to the 16 LSBs of the second oper- 
and. The result of both additions is stored in a 32-bit operand. 


MSBs LSBs 
+ + 

MSBs LSBs 

MSBs LSBs 


Example 2-11 shows how to rewrite the vec_mpy() function to include the 
_mpy and _mpylh intrinsics: 


The Revised Vector Multiply Function—vec_mpy2.c 


void vec_mpy2(int y[], const int x[], short scalar) 
{ 

int i, val; 

unsigned int temph, templ; 


for (i = 0; i < 75; it+) 


{ 


val = x[il; 

templ = (_mpy (scalar, val) >> 15) & Ox0000ffff; 
temph = (_mpylh(scalar, val) << 1) & Oxffff0000; 
yli] = _add2(y[i], temph | templ); 


} 
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Now, look at the iir1( ) function. Example 2-12 shows the same code that you 
saw in Example 2-4. 


Example 2-12. The Biquad Filter—iir1.c 


void iirl(const short *coefs, const short *input, 
short *optr, short *state) 


{ 


short xs 
short ce 
int ne 


for (n = 0; n < 50; n++) 
{ 
t = x + ((coefs[2] * state[0] + 
coefs[3] * state[1]) >> 15); 


x = t + ((coefs[0] * state[0] + 


coefs[1] * state[1]) >> 15); 
state[1] state[0]; 
state[0] = t; 
coefs += /* point to next filter coefs */ 


4; 
state += 2; /* point to next filter states */ 


} 


*xoptr++ = x; 
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You can improve the iir( ) function by using the same methods that you used 
to improve the vec_mpy( ) function. Example 2-13 shows how to rewrite the 
iir( ) function: 


Example 2-13. The Revised Biquad Filter—iir2.c 


void iir2(const int *coefs, const short *input, 
short *optr, short *state) 


{ 


short x 
short bs 
int n; 
x = input[0]; 


for (n = 0; n < 50; n++) 


t= xt+((_mpy(coefs[1],state[0]) 
_mpyhl (coefs[1],state[1])) >> 15); 
x= t+((_mpy (coefs[0],state[0]) 
_mpyhl(coefs[0],state[1])) >> 15); 
state[1] = state[0]; 
state[0] = t; 


coefs += 2; 
state += 


No 
~ 


*optrt++ = x; 
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Using demo2.c, shown in Example 2—14, run the revised functions through the 


compiler, assembler, and linker. 


Example 2-14. The Revised Example—demo2.c 


main(int argc, char *argv[]) 


{ 

const short coefs[100]; 
short optr[100]; 

short state[2]; 

const short a[100]; 
const short b[100]; 

int c = 0; 
int dotp[1] 
int sum= 0; 
short y[100]; 

short scalar = 3345; 
const short x[100]; 


{O}; 


sum = macl(a, b, c, dotp); 
vec_mpy2(y, x, scalar); 
iir2(coefs, x, optr, state); 
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If you did not set up your C_OPTIONS environment variable as described 
on page 2-6, enter the following on a command line: 


cl6x -g -o -k -mg demo2.c macl.c vec_mpy2.c iir2.c 
-z Ilnk.cmd -1 rts6201.1ib -o demo2.out 


OR 


If you set up your C_OPTIONS environment variable as described on 
page 2-6, enter the following on a command line: 


cl6x -z -o demo2.out 


Although the —z option is already specified in the C_OPTIONS environment 
variable, you need to specify it on the command line to indicate that this oc- 
currence of —0 is a linker option. 


The inner loop of the vec_mpy2( ) function translates into the following assem- 
bly output: 


Example 2-15. Inner Loop Kernel of vec_mpy2.asm 


L3: ; PIPED LOOP KERNEL 

OR .L2X B5,A8,B7 7@ 
SHL ea. Ao,1,A4 7 @@ 

[ Al] B £82 L3 7@@ 
AND .L1 A5,A4,A6 7@@ 
LDW .D2 *+B4(12),B5 ;@@@ 
MPYLH .M1 AO, A9,A6 7@@@ 
LDW » Dial *A34+4+,A9 7 @@ACEE 
STW .D2 B6, *B4t++ ; 
ADD2 .S2 B5,B7,B6 7@ 
AND /L1 A7,A4,A8 7@@ 
MV .L2X A6,B5 7@@ 

[ Al] SUB .D1 Al,1,Al 7@@@ 
SHR -S1 A8,15,A4 7 @@@ 
MPY -M1 AO,A9,A8 7 @@@@ 


As you can see, the code for the vec_mpy2( ) function is improved over the 
original vec_mpy( ) code. Two LDW instructions are loading four elements 
(x[i], x[i+1], y[i], and y[i+1]), and one STW instruction is storing two elements: 
x[i] and y[i+1]. With the revised code, two y[i] results are stored every two 
cycles. Recall that only one y[i] result was stored every two cycles in 
Example 2-10. 
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Table 2-3 shows how the vec_mpy( ) function has improved as it moved from 
phase 1 to phase 2. 


Table 2-3. Revised Cycle Counts for vec_mpy( ) 


Function Execute Packets Loop Iterations Constant Cycle Count 
vec_mpy1() 2 150 16 2 x 150+ 16 = 316 
vec_mpy2( ) 2 75 22 2x 75+22=172 


Now, look at the inner loop of the third function, iir( ). Example 2-16 shows the 
assembly output of the innermost loop for the revised iir( ) function: 


Example 2-16. Inner Loop Kernel of iir2.asm 


L3: ; PIPED LOOP KERNEL 
ADD én B7,B8,B7 ; 
ADD Baal A0,A3,A0 i 
MV oe B6o,B9 7@ 
STH .D1 A5, *+A4 (6) 7@ 
LDW .D2 *B5++(8),B8 ;@@ 
SHR .S2 B7,15,B7 ; 
EXT woul AO,16,16,A0 ; 
[ BO] SUB ee BO,1,BO 7@ 
PY .M2X B8,A5,B8 7@ 
ADD .L1X B6,A3,A3 7@ 
LDH .D2 *+B4(14),B6 ;@@@ 
ADD .L1X A0,B7,A6 ; 
PYHL .M2 B8,B9,B7 7@ 
SHR .S1 A3,15,A3 7@ 
[ BO] B .S2 L3 7@ 
LDW .D2 *+B5(4),B7 7@@G@ 
LDH .D1 *+A4(12),A5 ;@@@ 
ADD £2 4,B4,B4 ; 
STH .D1 AO, *A4++(4) ; 
EXT ZL A6,16,16,A0 ; 
PYHL .M2 B7,B6,B6 7@@ 
PY .MLX B7,A5,A3 7@@ 
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Table 2—4 shows how the iir( ) function has improved. Now, the code has only 


four execute packets; however, each packet has only five parallel instructions, 
which could be probably improved. 


Table 2-4. Revised Cycle Counts for iir( ) 


Function Execute Packets Loop Iterations Constant Cycle Count 
iirt() 5 50 20 5 x 50+ 20 = 270 
iir2() 4 50 20 4 x 50 + 20 = 220 


Use the profiler to view the cycle counts of the revised functions. 


Your profile window should look like this: 


fin) Profile |. |O} x} 
Area Name Count Inclusive Incl—Max 

C Function vec_mpy2() 1 172 172 

C Function iir2() 1 220 220 

C Function wmacil{) 1 167 167 

C Function main } 1 637 637 


Notice that the cycle count for the second function, the vector multiply, is down 
from 316 to 172. The IIR filter has improved also: it is down from 270 to 220. 
However, the cycle count for the IIR filter is still too high. Naturally, the cycle 
count for main( ) has decreased also. It is down from 831 to 637. 


Table 2-5. Revised Cycle Counts 


Function Execute Packets Loop Iterations Constant Cycle Count 
mact1( )t 1 150 17 1 x 150+ 17 = 167 
vec_mpy2( ) 2 75 22 2x 75+22=172 
iir2() 4 50 20 4 x 50 + 20 = 220 


T The cycle count for the mac1( ) function has not changed. 
You have done everything you can to refine the C code in the iir( ) function. To 


improve your code at this point, you need to use the assembly optimizer. This 
leads you to phase 3 of the code development flow. 
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2.7 Lesson 5: Phase 3 of the Code Development Flow 
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To further improve the iir( ) function, you will need to rewrite it in linear assem- 
bly. Linear assembly is the input for the assembly optimizer. 


Linear assembly is similar to regular *C62xx assembly code in that you use 
’C62xx instructions to write your code. With linear assembly, however, you do 
not need to specify all of the information that you need to specify in regular 
’C62xx assembly code. With linear assembly code, you have the option of 
specifying the information or letting the assembly optimizer specify it for you. 
Here is the information that you do notneed to specify in linear assembly code: 


[j Parallel instructions 

Lj Pipeline latency 

Lj Register usage 

Lj Which functional unit is being used 


If you choose not to specify these things, the assembly optimizer determines 
the information that you do not include, based on the information that it has 
about your code. As with other code generation tools, you might need to modify 
your linear assembly code until you are satisfied with its performance. When 
you do this, you will probably want to add more detail to your linear assembly. 
For example, you might want to specify which functional unit should be used. 


Before you use the assembly optimizer, you need to know the following things 
about how it works: 


Li Alinear assembly file must be specified with a .sa extension. 


[j Linear assembly code should include the .cproc and .endproc directives. 
The .cproc and .endproc directives delimit a section of your code that you 
want the assembly optimizer to optimize. Use .cproc at the beginning of 
the section and .endproc at the end of the section. In this way, you can set 
off sections of your assembly code that you want to be optimized, like pro- 
cedures or functions. 


(Lj Linear assembly code may include a .reg directive. The .reg directive al- 
lows you to use descriptive names for values that will be stored in regis- 
ters. When you use .reg, the assembly optimizer chooses a register whose 
use agrees with the functional units chosen for the instructions that oper- 
ate on the value. 


(J Linear assembly code may include a .trip directive. The .trip directive 
specifies the value of the trip count. The trip count indicates how many 
times a loop will iterate. 


Now that you have some information about the fundamentals of linear assem- 
bly code, look at the revised C code for the biquad filter again. Example 2-17 
shows the same code that you saw in Example 2-13. 
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Example 2-17. The Revised Biquad Filter—iir2.c 


void iir2(const int *coefs, const short *input, 
short *optr, short *state) 


{ 


short uF 
short is 
int Ti? 


for (n = 0; n < 50; n++) 


t= x+((_mpy(coefs[1],state[0]) + 
_mpyhl(coefs[1],state[1])) >> 15); 


x= t+((_mpy(coefs[0],state[0]) + 


_mpyhl(coefs[0],state[1])) >> 15); 
state[1] = state[0]; 
state[0] = t; 
coefs += 2; 
state += 2; 


*optr++ = x; 


Example 2-18 shows how to rewrite the iir( ) function in linear assembly. 
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Example 2-18. The Biquad Filter, Revised and Assembly-Optimized—iir3.sa 


Rs 


Part I 


LOOP: 


Tene 
Leer] 


s10, 
p3, 


623; < 
$238, 


cptr0,cptrl 
sptr0,sptrl 


*Cpterod 632 
*cptrl1,cl10 
*sptr0,sl10 
s10,sl0p 


c32,s10,p2 
c32,s10,p3 
p2,p3,s23 
$23,15,s23_s 
$238; x,t 
t,mask,t 


c10,s10,p0 
c10,s10,pl 
p0,pl,si0_t 
s10_t,15,s10 
SOUS ples 


s10p,16,s1 
t; sl), sol 
s01,*sptrl 


=1L,ckr,ctr 
LOOP 


.def _iir3 
-cproc cptr0,sptrod 
.reg cptrl, s0Ol, 
-reg pO, pl, 

MV 2 

MV va. 

MVK aU yextc 
-trip 50 

LDW -D1ITL 
LDW sD2T2 
LDW sDLITF2 
MV +2 

MPY -M1 
MPYH -M1 

ADD sls 

SHR aril 

ADD HZ 

AND ie 

MPY .M2 
MPYH .M2 

ADD ie 

SHR eZ, 

ADD «2 

SHL wi 

OR 22 

STW -D1 

ADD Poy 

B Sl 
-endproc 


10, 
sl, 


iS 


C32, 


t, 


sl10_s, 


x, mask, 


s10_t 
sptrl, 


sl0p, ct 


setup loop counter 


coefAddr [3 
CoefAddr [1 
StateAddr [1] 


save StateAddr[1] 


CoefAddr [2 
CoefAddr [3 
CA[2] * SA 
(CA[2] 
t = x+((CA 
clear upper 


0] 
* SA[O] + CA[3] * 
2]*SA[0]+CA[3 


& CoefAddr [2 
& CoefAddr [0 
& StateAddr 


* StateAddr [0 
* StateAddr[1 
+ CA[3] * S 


16 bits 


r 


0] 


& StateAddr[0] 


] 
] 
A{1] 

SA[1]) >> 15 
*SA[1])>>15) 


CoefAddr[0] * StateAddr[0] 

CoefAddr[1] * StateAddr[1] 

CA[O] * SA[O] + CA[1] * SA[1] 

(CA[O] * SA[O] + CA[1] * SA[1]) >> 15 
x = t+((CA[0]*SA[0]+CA[1]*SA[1])>>15) 


StateAddr[1] 
StateAddr [0] 


store StateAddr[1] 


= StateAddr 
=t 


& Stat 


dec outer lp cntr 
Branch outer loop 


0] 


eAddr [0] 


2-26 


Lesson 5: Phase 3 of the Code Development Flow 


Using demo2.c, shown in Example 2-19, run the revised functions through the 
code generation tools. 


Example 2-19. The Revised Example—demo3.c 


main(int argc, char *argv[]) 


{ 


const short coefs[150]; 
short optr[150]; 
short state[2]; 
const short a[150]; 
const short b[150]; 
int c = 0; 

int dotp[1] = {0}; 
int sum = 0; 

short y[150]; 

short scalar = 3345; 
const short x[150]% 


sum = macl(a, b, c, dotp); 
vec_mpy2(y, x, scalar); 
ilir3(coefs, x, optr, state); 


Use the shell program (cl6x) to compile, assemble, and link. Be sure you use 
the —mg option. The —mg option ensures that the optimizations that are used 
are compatible with profiling. 


On a command line, enter: 


cl6x -g -o -k -mg demo3.c macl.c vec_mpy2.c iir3.sa 
-z lnk.cmd -1 rts6201.1ib -o demo3.out 


Notice that you used the shell program to compile a linear assembly file and 
aC file at the same time. Also notice that (except for the —mg option) you used 
the same options that you used in the first part of this tutorial. The assembly 
optimizer has a small set of some unique options, but many of the options that 
you will use are shell options that apply to either linear assembly files or C files. 
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Example 2-20. Inner Loop Kernel of iir3.asm 


Li3:t 


[ Al] 


[ Al] 


ED LOOP KERNEL 
-L2 B3,B7,B0 ; clear upper 16 bits 
~S2 BO,B8,B8 7@ CA[O] * SA[O] + CA[1] * SA[1] 
Pou L3 7@ Branch outer loop 
-L1 A4,A5,A4 7@ CA[2] * SA[O] + CA[3] * SA[1] 
.M2 B2,B1,B8 ,;@@ CoefAddr[1] * StateAddr [1] 
-M1X AO,B1,A4 ;@@ CoefAddr[2] * StateAddr[0] 
«D2 *B6,B2 ;@@@@ CoefAddr[1] & CoefAddr[0 
Dey *A3,A0 ;@@@@ coefAddr[3] & CoefAddr[2 
.D2 B4,B0,B9 ; xX = t+((CA[O]*SA[0]+CA[1]*SA[1])>>15) 
~L2 BO,B9,BO ; StateAddr[0] = t 
~S2 B8,0xf,B4 7@ (CA[O] * SA[O] + CA[1] * SA[1]) >> 15 
souk A4,0xf,A5 7@ (CA[2] * SA[O] + CA[3] * SA[1]) >> 15 
.M2 B2,B1,B0 ,;@@ CoefAddr[0] * StateAddr [0] 
.M1X AO,B1,A5 ;@@ CoefAddr[3] * StateAddr[1] 
DA *A6,Bl1 ,;@@@@ StateAddr[1] & StateAddr[0 
.D1 BO, *A7 ; store StateAddr[1] & StateAddr[0] 
~S2 B5,0x10,B9 7@ StateAddr[1] = StateAddr[0] 
~L2X B9,A5,B3 7@ t = x+((CA[2]*SA[0]+CA[3]*SA[1])>>15) 
eile Oxffffffff,Al,Al ;@@ dec outer lp cntr 
wD2 B1,B5 ;@@ save StateAddr[1] & StateAddr[0] 


Table 2-6 shows how the iir( ) function has improved as it has moved through 


the three phases of code development. 


Table 2-6. Revised Cycle Counts for iir( ) 


Function 
iirt() 
iir2( ) 
iir3() 
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Execute Packets 


6 
4 


Loop Iterations 


50 
50 
50 


Constant Cycle Count 
20 6 x 50+ 20 = 270 
20 4 x 50 + 20 = 220 
27 3 x 50+ 27=177 
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Use the profiler to view the cycle counts of the revised functions. 


Your profile window should look like this: 


fim) Profile |. |O} x! 
Area Nane Count Inclusive Incl—Max 

C Function vec_mpy2(} 1 172 172 

C Function 1ir3(} 1 17? 17? 

C Function macl{) 3 167 167 

C Function main(} 1 594 594 


Notice that the cycle count for the IIR filter has improved: it is down from 220 


to 177. Naturally, the cycle count for main( ) has decreased also. It is down 
from 637 to 594. 


Table 2-7. Revised Cycle Counts 


Function Execute Packets Loop Iterations Constant Cycle Count 
mact()t 1 150 17 1 x 150+ 17 = 167 
vec_mpy2( )t 2 75 22 2x 75+ 22=172 
iir3() 3 50 27 


3 x 50+ 27=177 
T The cycle count for the mac1() function and the vec_mpy( ) function have not changed. 


The Using the Assembly Optimizer chapter in the TMS320C6x Optimizing C 
Compiler User’s Guide discusses the assembly optimizer in more detail. 
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Summary 


2.8 Summary 
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Congratulations! In this tutorial, you learned the following things: 


a 


The three phases of code development, how to determine which phases 
are appropriate for improving different parts of your code, and how to write 
your code for each phase. 


What a linear assembly file is and some fundamental information on how 
to write one. 


How to use the code generation tools to compile, assemble, and link your 
C and linear assembly files. 


How to use the profiler to analyze your results and determine whether or 
not you need to continue refining your code. 


Part / 
Introduction 


Part Il 


C Code 


Part Il 
Assembly Code 


Part IV 


Appendix 


I] Hed 


Chapter 3 


Optimizing C Code 


You can maximize C performance by using compiler options, intrinsics, and 
code transformations. This chapter discusses the following topics: 


Lj The compiler and its options 
Lj Intrinsics 

_] Software pipelining 

_j Loop unrolling 
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Topic Page 


3.1 Writing C Code 
3:25. Gompiling;C:Codenaite ceases sentence et hea ere cian ates s 3-4 
3.3 Refining C Code 
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3.1. Writing C Code 


3.1.1 


This chapter shows you how to analyze and tailor your code to be sure you are 
getting the best performance from the ’C6x architecture. 


Tips on Data Types 


Give careful consideration to the data type size when writing your code. The 
’C6x compiler defines a size for each data type (signed and unsigned): 


Lj char 8 bits 

Lj short 16 bits 

CL) int 32 bits 

Lj long 40 bits 

Based on the size of each data type, follow these guidelines when writing C 
code: 

_j Avoid code that assumes that int and long types are the same size, 


L 


because the ’C6x compiler uses long values for 40-bit operations. 


Use the short data type for multiplication inputs whenever possible 
because this data type provides the most efficient use of the 16-bit multi- 
plier in the ’C62xx. 


Use int or unsigned int types for loop counters, rather than short or un- 
signed short data types, to avoid unnecessary sign-extension instructions. 


3.1.2 Analyzing C Code Performance 
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Use the following techniques to analyze the performance of specific code 
regions: 


a 


One of the preliminary measures of code is the time it takes the code to 
run. Use the clock ( ) and printf ( ) functions in C to time and display the 
performance of specific code regions. You can use the stand-alone simu- 
lator (load6x) to run the code for this purpose. 


Use the profile mode in the debugger, as explained in the TMS320C6x 
C Source Debugger User’s Guide, to collect execution statistics about 
specific areas in your code. 


Writing C Code 


(J Use breakpoints, the clk register, and the RUNB command in the 
debugger, as described in the TMS320C6x C Source Debugger User's 
Guide, to track the number of CPU clock cycles consumed by a particular 
section of code. 


(1 The critical performance areas in your code are most often loops. The 
easiest way to optimize a loop is by extracting it into a separate file that 
can be rewritten, recompiled, and run stand-alone. 


As you use the techniques described in this chapter to optimize your C code, 
you can then evaluate the performance results by running the code and 
looking at the instructions generated by the compiler. 
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3.2 Compiling C Code 


The ’C6x compiler offers high-level language support by transforming your C 
code into more efficient assembly language source code. The compiler tools 
include a shell program (cl6x), which you use to compile, assembly optimize, 
assemble, and link programs ina single step. To invoke the compiler shell, en- 


ter: 


cl6x [options] [filenames] [-z [linker options] [object files]| 


For acomplete description of the C compiler and the options discussed in this 
chapter, see the TMS320C6x Optimizing C Compiler User’s Guide. 


3.2.1 Compiler Options 


Options control the operation of the compiler. Table 3—1 defines the options 
discussed in this chapter. 


Table 3-1. Subset of Compiler Options 


Option 
—ot 


—pmt 


Description 


Enables software pipelining and other optimizations in the 
compiler 


Enables program-level optimization 


Enables the compiler to use assumptions that allow it to 
be more aggressive with certain optimizations 


Allows you to profile optimized code 
Ensures that redundant loops are not generated 
Keeps the assembly file so that you can inspect it 


Disables software pipelining 


T Although -03 is preferable, at a minimum use the —o option. 
+ Use the —pm option for as much of your program as possible. 
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3.2.2 Memory Dependencies 


To maximize the efficiency of your code, the ’C6x compiler schedules as many 
instructions as possible in parallel. To schedule instructions in parallel, the 
compiler must determine the relationships, or dependencies, between instruc- 
tions. Dependency means that one instruction must occur before another. 
Because only independent instructions can execute in parallel, dependencies 
inhibit parallelism. 


(J Ifthe compiler cannot determine that two instructions are independent (for 
example, b does not depend on a), it assumes a dependency and sched- 
ules the two instructions sequentially. 


(1 Ifthe compiler can determine that two instructions are independent of one 
another, it can schedule them in parallel. 


Often it is difficult for the compiler to determine if instructions that access 
memory are independent. The following techniques help the compiler deter- 
mine which instructions are independent: 


Lj Use the const keyword to indicate which objects are not changed by a 
function. 


_j Use the-pm (program-level optimization) option, which gives the compiler 
global access to the whole program or module and allows it to be more 
aggressive in ruling out dependencies. 


(J Use the —mt option, which allows the compiler to use assumptions that al- 
low it to eliminate dependencies. 


To illustrate the concept of memory dependencies, it is helpful to look at the 
algorithm code in a dependency graph. Example 3—1 shows the C code for a 
basic vector sum. Figure 3—1 shows the dependency graph for this basic vec- 
tor sum. (For more information, see section 5.2.4, Drawing a Dependency 
Graph, on page 5-5.) 


Example 3-1. Basic Vector Sum 


void vecsum(short *sum, short *inl, short *in2, unsigned int N) 


Inte dy 
for (i = 0; i < N; itt) 
sum[i] = inl[i] + in2[il; 
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Figure 3-1. Dependency Graph for Vector Sum #1 


Load Load 
5 5 
Number of cycles required Add elements { 
to complete an instruction 1 % ¥ 
1 
Store to 


(ren) 


The dependency graph in Figure 3—1 shows that: 


Lj The paths from sun({i] back to in1[i] and in2[i] indicate that writing to sum 
may have an effect on the memory pointed to by either in1 or in2. 


Lj Aread from int or in2 cannot begin until the write to sum finishes, which 
creates an aliasing problem. Aliasing occurs when two pointers can point 
to the same memory location. For example, if vecsum() is called in a pro- 
gram with the following statements, in1 and sum alias each other because 
they both point to the same memory location: 


short a[10], b[10]; 
vecsum(a, a, b, 10); 


3.2.2.1. The const Keyword 
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In Figure 3-1, the reads from in1 and in2 finish before the write to sum within 
a single iteration. However, the ’C6x compiler uses software pipelining to exe- 
cute multiple iterations in parallel and, therefore, must determine memory 
dependencies that exist across loop iterations. 


To help the compiler, you can qualify an object with the const keyword, which 
indicates that a variable or the memory referenced by a variable will not be 
changed, but will remain constant. It is good coding practice to use the const 
keyword wherever you can, because it is a simple way to increase the perfor- 
mance and robustness of your code. 


Compiling C Code 


Example 3—2 shows the vecsum() example rewritten with the const keyword 
to indicate that the write to sum never changes the memory referenced by in1 
and in2. Figure 3—2 shows the revised dependency graph for the code in the 
inner loop. 


Example 3-2. Vector Sum With const Keywords 


void vecsum2 (short *sum, const short *inl, const short *in2, unsigned int N) 


mat v1; 
for “(d= Op a < Np a) 
sum[i] = inl[i] + in2[i]; 


Figure 3-2. Dependency Graph for Vector Sum #2 
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Example 3-3 shows the output of the compiler for the vector sum in 
Example 3-2. The compiler finds better schedules when dependency paths 
are eliminated between instructions. For this loop, the compiler found a soft- 
ware pipeline with a 2-cycle kernel, compared with seven cycles for the 
previous loop. (The kernel is the body of a pipelined loop where all instructions 
execute in parallel.) 
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Example 3-3. Compiler Output for Vector Sum Code 


L14: ; PIPE LOOP KERNEL 
ADD .L1X B4,A0,A5 
[BO] B ~S2 L14 
LDH <Dil *A3++,A0 
STH .D1 A5, *A4++ 
[BO] SUB ~L2 BO,1,B0 
LDH .D2 *B5++,B4 


For basic information on assembly code, see Chapter 4, Structure of Assem- 
bly Code. 


3.2.2.2 Performing Program-Level Optimization (-pm Option) 


You can specify program-level optimization by using the —pm option with the 
—03 option. With program-level optimization, all your source files are compiled 
into one intermediate file called a module. The module moves to the optimiza- 
tion and code generation passes of the compiler. Because the compiler has 
access to the entire program, it performs several optimizations that are rarely 
applied during file-level optimization: 


() Ifaparticular argument in a function always has the same value, the com- 
piler replaces the argument with the value and passes the value instead 
of the argument. 


_j Ifareturn value of a function is never used, the compiler deletes the return 
code in the function. 


Lj Ifa function is not called, directly or indirectly, the compiler removes the 
function. 


3.2.2.3. The -—mt Option 


Another way to eliminate memory dependencies is to use the —mt option, 
which allows the compiler to use assumptions that can eliminate memory de- 
pendency paths. For example, if you use the —mt option when compiling the 
code in Example 3-1, the compiler uses the assumption that that in1 and in2 
do not alias memory pointed to by sum and, therefore, eliminates memory 
dependencies among the instructions that access those variables. 
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3.3 Refining C Code 


3.3.1 


You can realize substantial gains from the performance of your C code by refin- 
ing your code in the following areas: 


11 Using intrinsics to replace complicated C code 

_j Using word access to operate on 16-bit data stored in the high and low 
parts of a 32-bit register 

1 Software pipelining the instructions manually 


Using Intrinsics 


The ’C6x compiler provides intrinsics, special functions that map directly to 
inlined *C62xx instructions, to optimize your C code quickly. All instructions that 
are not easily expressed in C code are supported as intrinsics. Intrinsics are 
specified with a leading underscore (_) and are accessed by calling them as 
you Call a function. 


For example, saturated addition can be expressed in C code only by writing 
a multicycle function, such as the one in Example 3-4. 


Example 3-4. Saturated Add Without Intrinsics 


int sadd(int a, int b) 
{ 


int result; 
result = a+b; 
if (((a * b) & 0x80000000) == 0) 
if ((result * a) & 0x80000000) 
result = (a < 0) ? Ox80000000 : Ox7f£fffffFf; 
} 
} 


return (result); 


This complicated code can be replaced by the _sadd() intrinsic, which results 
in a single ’C62xx instruction (see Example 3-5). 


Example 3-5. Saturated Add With Intrinsics 


result = _sadd(a,b); 
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Table 3—2 lists the ’C62xx intrinsics. For more information on using intrinsics, 
see the TMS320C6x Optimizing C Compiler User’s Guide. 


Table 3-2. TMS320C6x C Compiler Intrinsics 


Assembly 

C Compiler Intrinsic Instruction Description 

int_abs(int src2); ABS Returns the saturated absolute value of src2. 

long _abs(long src2); 

int _add2(int src7, int src2); ADD2 Adds the upper and lower halves of srci to the 
upper and lower halves of src2 and returns the 
result. Any overflow from the lower half add will not 
affect the upper half add. 

uint _clr(uint src2, uint csta, uint cstb); CLR Clears the specified field in src2. The beginning 
and ending bits of the field to be cleared are speci- 
fied by csta and cstb, respectively. 

int _ext(uint src2, uint csta, int cstb); EXT Extracts the specified field in src2, sign-extended 
to 32 bits. The extract is performed by a shift left 
followed by a signed shift right; csta and cstb are 
the shift left and shift right amounts, respectively. 

uint _extu(uint src2, uint csta, uint cstb); EXTU Extracts the specified field in src2, zero-extended 
to 32 bits. The extract is performed by a shift left 
followed by a unsigned shift right; csta and cstb 
are the shift left and shift right amounts, respec- 
tively. 

uint _Imbd(uint src7, uint src2): LMBD Searches for a leftmost 1 or 0 of src2 determined 
by the LSB of src7. Returns the number of bits up 
to the bit change. 

int_mpy(int src7, int src2); MPY Multiplies the 16 LSBs of src1 by the 16 LSBs of 

int_mpyus(uint src7, int src2); MPYUS src2 and returns the result. Values can be signed 

int_mpysu(int src7, uint src2); MPYSU or unsigned. 

uint_mpyu(uint src7, uint src2); MPYU 

int_mpyh(int src7, int src2); MPYH Multiplies the 16 MSBs of src1 by the 16 MSBs of 

int_mpyhus(uint src7, int src2); MPYHUS src2 and returns the result. Values can be signed 

int_mpyhsu(int src7, uint src2); MPYHSU or unsigned. 

uint_mpyhu(uint src7, uint src2); MPYHU 

int_mpyhl(int src7, int src2); MPYHL Multiplies the 16 MSBs of src1 by the 16 LSBs of 

int_mpyhuls(uint src7, int src2); MPYHULS src2 and returns the result. Values can be signed 

int_mpyhslu(int src7, uint src2); MPYHSLU or unsigned. 

uint_mpyhlu(uint src7, uint src2); MPYHLU 
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Table 3-2. TMS320C6x C Compiler Intrinsics (Continued) 


Assembly 

C Compiler Intrinsic Instruction Description 

int_mpylh(int src7, int src2); MPYLH Multiplies the 16 LSBs of src1 by the 16 MSBs of 

int_mpyluhs(uint src7, int src2); MPYLUHS _ src2 and returns the result. Values can be signed 

int_mpylshu(int src7, uint src2); MPYLSHU or unsigned. 

uint_mpylhu(uint src7, uint src2); MPYLHU 

void _nassert(int); Generates no code. Tells the optimizer that the 
expression declared with the assert function is 
true; this gives a hint to the optimizer as to what 
optimizations might be valid. 

uint_norm(int src2); NORM Returns the number of bits up to the first nonre- 

uint _Inorm(long src2); dundant sign bit of src2. 

int_sadd(int src7, int src2); SADD Adds srci to src2 and saturates the result. Returns 

long _Isadd(int src7, long src2): the result. 

int_sat(long src2); SAT Converts a 40-bit value to an 32-bit value and 
saturates if necessary. 

uint _set(uint src2, uint csta, uint cstb); SET Sets the specified field in src2 to all 1s and returns 
the src2 value. The beginning and ending bits of 
the field to be set are specified by csta and cstb, 
respectively. 

int_smpy(int src7, int sr2); SMPY Multiplies src1 by src2, left shifts the result by one, 

int_smpyh(int src7, int sr2); SMPYH and returns the result. If the result is Ox8000 0000, 

int_smpyhl(int src7, int sr2); SMPYHL saturates the result to Ox7FFF FFFF. 

int_smpylh(int src7, int sr2); SMPYLH 

uint__sshl(uint src2, uint src); SSHL Shifts src2 left by the contents of src1, saturates 
the result to 32 bits, and returns the result. 

int _ssub(int src7, int src2); SSUB Subtracts src2 from src1, saturates the result size, 

long _Issub(int src7, long src2): and returns the result. 

uint_sube(uint src7, uint src2); SUBC Conditional subtract divide step. 

int_sub2(int src7, int src2); SUB2 Subtracts the upper and lower halves of src2 from 


the upper and lower halves of src7, and returns the 
result. Any borrowing from the lower half subtract 
does not affect the upper half subtract. 
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3.3.2 Using Word Access for Short Data 


The ’C62xx has instructions with corresponding intrinsics, such as _add2 ( ), 
_mpyhl ( ), mpylh ( ), that operate on 16-bit data stored in the high and low 
parts of a 32-bit register. When operating on a stream of short data, you can 
use word (int) accesses to read two short values at atime, and then use ’C62xx 
intrinsics to operate on the data. For example, rewriting the vecsum ( ) function 
to use word accesses (as in Example 3-6) doubles the performance of the 
loop. See section 5.3, Loading Two Data Values with LDW, on page 5-10 for 
more information. 


Example 3-6. Vector Sum With const Keywords, _nassert, Word Reads 


{ 


int i; 


const int 
const int 
int 


for (i = 0; 
i] 


i_sum[ 


void vecsum4(short *sum, const short *inl, const short *in2, unsigned int N) 


*i_inl = (const int *)inl; 
*i_in2 = (const int *)in2; 
*i_sum = (int *) sum; 


_nassert (N >= 20); 


An < 


= _add2(i_inl[i], i_in2[i]); 


(N/2); itt) 


TE | 


Note: 


The _nassert intrinsic tells the optimizer that the code that follows meets the 
condition specified. 


Cd 


This transformation assumes that the pointers sum, in1, and in2 can be cast 
to int *, which means that they must point to word-aligned data. By default, the 
compiler aligns all short arrays on word boundaries; however, a call like the 
following creates an illegal memory access: 


short a[5l], b[50], c[50]; vecsum4(&a[1l], b, c, 50); 


Another consideration is that the loop must now run for an even number of 
iterations. You can ensure that this happens by padding the short arrays so 
that the loop always operates on an even number of elements. 


Refining C Code 


If a vecsum ( ) function is needed to handle short-aligned data and odd-num- 
bered loop counters, then you must add code within the function to check for 
these cases. Knowing what type of data is passed to a function can improve 
performance considerably. It may be useful to write different functions that can 
handle different types of data. If your short-data operations always operate on 
even-numbered word-aligned arrays, then the performance of your applica- 
tion can be improved. However, Example 3-7 provides a generic vecsum() 
function that handles all types of data. 


Example 3-7. Vector Sum With const Keywords, _nassert, Word Reads (Generic Version) 


void vecsum5 (short *sum, const short *inl, const short *in2, unsigned int N) 


{ 


int. 2; 


_nassert(N >= 20); 


if (((int)sum | (int)in2 | (int)inl) & 0x2) 
{ 
for (i = 0; i < Nj; itt) 
sum[i] = inl[i] + in2[i]; 
} 
else 


(const: int. *yanls 
(const int *)in2; 
(int -*) sum; 


const int *i_inl 
const int *i_in2 
int *i_sum 


for (i = 0; i < (N/2); i++) 
i_sum[i] = _add2(i_inl[i], i_in2[i]); 


if (N & Oxl1) sum[ij] = inl[i] + in2[i]; 


3.3.2.1 Using Word Access in Dot Product 


Other intrinsics that are useful for reading short data as words are the multiply 
intrinsics. Example 3-8 is a dot product example that reads word-aligned short 
data and uses the _mpy ( ) and _mpyh ( ) intrinsics. The _mpyh ( ) intrinsic 
uses the ’C62xx instruction MPYH, which multiplies the high 16 bits of two 
registers, giving a 32-bit result. 


This example also uses two sum variables (Sum1 and sum2). Using only one 
sum variable inhibits parallelism by creating a dependency between the write 
from the first sum calculation and the read in the second sum calculation. 
Within a small loop body, avoid writing to the same variable, because it inhibits 
parallelism and creates dependencies. 
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Example 3-8. Dot Product Using Intrinsics 


int dotprod(const short *a, const short *b, unsigned int N) 
{ 


int i, suml = 0, sum2 = 0; 


const int *i_a 
const int *i_b 


(const int *)a; 
(const int. *)b; 


for {1 = 0; 42. = {NN So 1 )e a+) 

{ 
suml = suml + _mpy (i_ali], i_b[il); 
sum2 = sum2 + _mpyh(i_a[i], i_b[il]); 


} 


return suml + sum2; 


3.3.2.2 Using Word Access in FIR Filter 


Example 3-9 shows an FIR filter that can be optimized with word reads of short 
data and multiply intrinsics. 


Example 3-9. FIR Filter—Original Form 


void firl(const short x[], const short h[], short y[], int n, int m, int s) 
{ 

Int, +237 

long y0; 

long round = 1L << (s - 1); 


for (Jj = 0; 3 < m; jtt) 
{ 


yO = round; 


for (i = 0; i < nj itt) 
yO += x[i + j] * hfil; 


y(j] = (int) (yO >> s); 


Example 3-10 shows an optimized version of Example 3-9. The optimized 
version passes an int array instead of casting the short arrays to int arrays and, 
therefore, helps ensure that data passed to the function is word-aligned. As- 
suming that a prototype is used, each invocation of the function ensures that 
the input arrays are word-aligned by forcing you to insert a cast or by using int 
arrays that contain short data. 


Example 3-10. FIR Filter— Optimized Form 
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{ 


voLd-Prre (conse Int eT], 


1 
ul: 
aE 


£ 
{ 


soem ae kr aS} 
ong yO, yl; 


ong round = 1L << 


_nassert(m >= 16); 
_nassert(n >= 16); 


or (j = 07; 3 < (m >> 1); Jt+t) 
yO = yl = round; 
for (i = 0; i < (n >> 1); itt) 
{ 
yO += _mpy (x[i + jl, h[i] 
yO += _mpyh (x[i + 3], hfi 
yl += _mpyhl(x[i + 3], hi 
yl += _mpylh(x[i + j+ 1], h 
} 
*y++ = (int) (yO >> s); 
*y++ = (int) (yl >> s); 


const int h[], 


short yl], 


int n, 


int m, int s) 
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3.3.3 Software Pipelining 


Software pipelining is a technique used to schedule instructions from a loop 
so that multiple iterations of the loop execute in parallel. When you use the —o2 
and —03 compiler options, the compiler attempts to software pipeline your 
code with information that it gathers from your program. 


Figure 3-2 illustrates a software-pipelined loop. The stages of the loop are 
represented by A, B, C, D, and E. In this figure, a maximum of five iterations 
of the loop can execute at one time. The shaded area represents the loop ker- 
nel. In the loop kernel, all five stages execute in parallel. The area immediately 
before the kernel is known as the pipelined-loop prolog, and the area immedi- 
ately following the kernel is known as the pipelined-loop epilog. 


Figure 3-3. Software-Pipelined Loop 


Pipelined-loop prolog 


Kernel 


Pipelined-loop epilog 


Because loops present critical performance areas in your code, consider the 
following areas to improve the performance of your C code: 


Lj Trip count 


_j Redundant loops 


_j Loop unrolling 


Refining C Code 


3.3.3.1 Trip Count Issues 


A trip count is the number of times that a loop executes; the trip counter is the 
variable used to count each iteration. When the trip counter reaches a limit 
equal to the trip count, the loop terminates. The structure of a software pipeline 
requires the execution of a minimum number of loop iterations (a minimum trip 
count) in order to fill, or prime, the pipeline. 


Loops that are eligible for software pipelining have loop trip counters that count 
down. In most cases, the compiler can transform the loop to use a trip counter 
that counts down even if the original code was not written that way. 


For example, the optimizer transforms the loop in Example 3—11(a) to some- 
thing like the code in Example 3—11(b). 


Example 3-11. Trip Counters 
(a) Original code 


for (i = 0; i < Nj; itt) /* i = trip counter, N = trip count */ 


(b) Optimized code 


for (i = N; i != 0; i--) /* Downcounting trip counter */ 


The minimum trip count for a software pipeline is determined by the number 
of iterations executing in parallel. 


If the compiler knows the trip count, it can generate faster and more compact 
code. If the compiler cannot determine that a loop always executes for the 
minimum trip count, it generates a redundant unpipelined loop. The redundant 
unpipelined loop is executed only when the runtime trip count is less than the 
minimum trip count; otherwise, the software-pipelined version of the loop is 
executed. 
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3.3.3.2 Eliminating Redundant Loops 


In Example 3-2 on page 3-7, the compiler cannot determine if the loop 
always executes more than the minimum trip count. Therefore, it generates 
two versions of the loop: 


J An unpipelined version that executes if N is less than the minimum trip 
count 


Li Asoftware-pipelined version that executes if N is equal to or greater than 
the minimum trip count 


To indicate to the compiler that you do not want two versions of the loop, you 
can use the —ms option so that the compiler generates only the software-pipe- 
lined code and never generates a redundant loop; however, loops with an 
unknown trip count are not software pipelined. 


= 3.3.3.3. Communicating Trip-Count Information to the Compiler 
a When invoking the compiler, use the following options to communicate trip- 
count information to the compiler: 


(J Use the-—o3 and-—pm compiler options to allow the optimizer to access the 
whole program or large parts of it and to characterize the behavior of loop 
trip counts. 


() Use the _nassert intrinsic to help reduce code size by preventing the 
generation of a redundant loop or by allowing the compiler (with or without 
the —ms option) to software pipeline innermost loops. 


Example 3-12 shows the vector sum code with an _nassert intrinsic that 
asserts that N is always at least 10. 


Example 3-12. Vector Sum With const Keywords and _nassert 


void vecsum3 (short *sum, const short *inl, const short *in2, unsigned int N) 


{ 


TGS AF 
_nassert(N >= 10); 


for (i = 0; i < Nj; i++) 
sum[i] = inl[i] + in2[i]; 


See the TMS320C6x Optimizing C Compiler User’s Guide for a complete 
discussion of the —ms, —03, and —pm options and the _nassert intrinsic. 
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3.3.3.4 Loop Unrolling 


Another technique that improves performance is unrolling the loop; that is, ex- 
panding small loops so that each iteration of the loop appears in your code. 
This optimization increases the number of instructions available to execute in 
parallel. You can use loop unrolling when the operations in a single iteration 
do not use all of the resources of the ’C62xx architecture. 


In Example 3-13, the loop produces a new sun\i] every two cycles. Three 
memory operations are performed: a load for both in1[i] and in[2] and a store 
for sum[i]. Because only two memory operations can execute per cycle, two 
cycles are necessary to perform three memory operations. 


Example 3-13. Vector Sum With Three Memory Operations 


void vecsum2 (short *sum, const short *inl, const short *in2, unsigned int N) 


Inte a3 
for (i = 0; i < Nj; itt) 
sum[i] = inl[i] + in2[i]; 


The performance of a software pipeline is limited by the number of resources 
that can execute in parallel. In its word-aligned form (Example 3-14), the vec- 
tor sum loop delivers two results every two cycles because the two loads and 
the store are all operating on two 16-bit values at a time. 


Example 3-14. Word-Aligned Vector Sum 


void vecsum4(short *sum, const short *inl, const short *in2, unsigned int N) 
{ 


int i; 


const int *i_inl 
const int *i_in2 
int *i_sum 


(const int *)anl; 
(const int *)in2; 
(int *) sum; 


_nassert (N >= 20); 


for (i = 0; i < (N/2); itt) 
i_sum[i] = _add2(i_inl[i], i_in2[i]); 


Optimizing C Code 3-19 


Part Il 


Part Il 


Refining C Code 


If you unroll the loop once, the loop then performs six memory operations per 
iteration, which means the unrolled vector sum loop can deliver four results 
every three cycles (that is, 1.33 results per cycle). Example 3-15 shows four 
results for each iteration of the loop: sum[i] and sum[i+sz] each store an int 
value that represents two 16-bit values. 


Example 3—15 is not simple loop unrolling where the loop body is simply repli- 
cated. The additional instructions use memory pointers that are offset to point 
midway into the input arrays and the assumptions that the additional arrays are 
a multiple of four shorts in size. 


Example 3-15. Vector Sum Using const Keywords, _nassert, Word Reads, and 
Loop Unrolling 


void vecsum6(int *sum, const int *inl, const int *in2, unsigned int N) 


{ 
int i; 
int sz = N >> 2; 


_nassert(N >= 20); 


for (i = 0; i < sz; itt) 

{ 
sum[i] = _add2(inl[i], in2[i]); 
sum[it+sz] = _add2(inl[it+sz], in2[it+sz]); 


Software pipelining is performed by the compiler only on inner loops; there- 
fore, you can increase performance by creating larger inner loops. One 
method for creating large inner loops is to completely unroll inner loops that 
execute for a small number of cycles. 


In Example 3-16, the compiler pipelines the inner loop with a kernel size of one 
cycle; therefore, the inner loop completes a result every cycle. However, the 
overhead of filling and draining the software pipeline can be significant, and 
other outer-loop code is not software pipelined. 
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Example 3-16. FIR_Type2— Original Form 


void fir2(const short input[], const short coefs[], short out[]) 
{ 

UNG Dy JF 

int sum = 0; 


for (i = 0; i < 40; itt) 
{ 
for (j = 0; 3 < 16; jtt) 
sum += coefs[j] * input[i + 15 - j]; 


out [i] = (sum >> 15); 
} 
} 


For loops with a simple loop structure, the compiler uses a heuristic to deter- 
mine if it should unroll the loop. Because unrolling can increase code size, in 
some cases the compiler does not unroll the loop. If you have identified this 
loop as being critical to your application, then unroll the inner loop in C code, 
as in Example 3-17. 
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Example 3-17. FIR_Type2—Inner Loop Completely Unrolled 


void fir2_u(const short input[], const short coefs[], short out[]) 
{ 
LHe aps Vy 
int sum; 
for (i = 0; i < 40; i++) 
{ 
sum = coefs[0] * input[i + 15]; 
sum += coefs[1] * input[i 14]; 
sum += coefs[2] * input[i Ts) 
sum += coefs[3] * input [i 12)? 
sum += coefs[4] * input [i Ld) % 
sum += coefs[5] * input[i 10]; 
sum += coefs[6] * input[i a1 
sum += coefs[7] * input[i + 8]; 
sum += coefs[8] * input [i (3 
sum += coefs[9] * input[i + 6]; 
sum += coefs[{10] * input[i + 5]; 
sum += coefs[{1l] * input[i + 4]; 
sum += coefs[12] * input[i + 3]; 
sum += coefs[13] * input[i + 2]; 
sum += coefs[{14] * input[i + 1]; 
sum += coefs[{15] * input[i + 0]; 
out[i] = (sum >> 15); 
} 
} 


Now the outer loop is software-pipelined, and the overhead of draining and 
filling the software pipeline occurs only once per invocation of the function 
rather than for each iteration of the outer loop. 


3.3.3.5 What Disqualifies a Loop from Being Software-Pipelined 
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In a sequence of nested loops, the innermost loop is the only one that can be 
software-pipelined. The following restrictions apply to the software pipelining 
of loops: 


_j Although a software-pipelined loop can contain intrinsics, it cannot contain 


function calls. 


You must not have a conditional break (early exit) in the loop. 


The loop cannot have an incrementing loop counter. One reason that you 
run the optimizer with the —o2 or —03 option is to convert as many loops 
as possible into downcounting loops. 


a) 
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If the trip counter is modified within the body of the loop, it typically cannot 
be converted into a downcounting loop. For example, the following code 
is not software-pipelined: 


for (i = 07 a -< ny t+) 


A conditionally incremented loop control variable is not software-pipe- 
lined. For example, the following code is not software-pipelined: 


for (1. = .0e a = ae Se} 


If the code size is too large and requires more than the 32 registers in the 
’C62xx, it is not software-pipelined. 


If a register value is live too long, the code is not software-pipelined. See 
section 5.5.6.2, Live Too Long, on page 5-40 and section 5.9, Live-Too- 
Long Issues, on page 5-74 for examples of code that is live too long. 


If the loop has complex condition code within the body that requires more 
than the five ’C62xx condition registers, the loop is not software pipelined. 
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Chapter 4 


Structure of Assembly Code 


An assembly language program must be an ASCII text file. Any line of 
assembly code can include up to six items: 


_j Label 
LJ Conditions 
Lj Instruction 
J Functional unit 
J] Operands 
LJ Comment 
Topic Page 
AP abel sire circ. winters sncavereterc sen cetova setorevetans tate cstonaseve oie eve inte aera vaveranetesgan evel cere 4-2 
Ai?” BP ALAMO ans tence cca cere teoe eee vess face ore ta aren ce rece ete ei ce aeen eeotete state par ae enacareete 4-2 
AS a CONAINONS eotetese ee ee ee ee eae ee 4-3 = 
ALA em (ISUHUCUOMS dev cccrtcte ore ot ia let rece sacet eo elatotec a eiarel benny see eer aee ere meyer ee 4-4 o 
4:5) GFunctiomaliUnitSiiicecce ccrcte ce evee ereain ental aversvcla ethos eiavetevetete syerwreyaieiavere 4-6 
AiG POPOramd sary ctsycretere ctereeered fos pettera ster ap be arey stent eisie te ter ate ekeraiovarhsbernyereie 4-8 
ET SGOMMENTS soe sesscocereceveveicyesare cca wien ayete gracevelo neniayacalareyelersyecaceversye eyavereutpareevers 4-9 
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4.1 Labels 


A label identifies a line of code or a variable and represents a memory address 
that contains either an instruction or data. 


Figure 4—1 shows the position of the label in a line of assembly code. The colon 
following the label is optional. 


Figure 4—1. Labels in Assembly Code 


4.2 Parallel Bars 


label: parallel bars [condition] instruction unit operands ; comments 


Labels must meet the following conditions: 


Lj The first character of a label must be a letter. 
[1 The first character of the label must be in the first column of the text file. 
[j Labels can include up to 32 alphanumeric characters. 


Figure 4—2. Parallel Bars in Assembly Code 


label: parallelbars [condition] instruction unit operands ; comments 


An instruction that executes in parallel with the previous instruction signifies 
this with parallel bars (||). This field is left blank for an instruction that does not 
execute in parallel with the previous instruction. 


4.3 Conditions 


Conditions 


Five registers in the ‘C62xx are available for conditions: A1, A2, BO, B1, and 
B2. Figure 4-3 shows the position of a condition in a line of assembly code. 


Figure 4-3. Conditions in Assembly Code 


label: parallelbars [condition] instruction unit operands ; comments 


All ’C62xx instructions are conditional: 


Lj 
L} 


If no condition is specified, the instruction is always performed. 


If a condition is specified and that condition is true, the instruction 
executes. For example: 


With this condition... |The instruction executes if ... 
[Al] A1!=0 
[!Al] Ai =0 


If a condition is specified and that condition is false, the instruction does 
not execute. 


With this condition... |The instruction does not execute if ... 

[Al] A1 =0 = 
5 

[!Al1] Ai! =0 o 
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4.4 Instructions 


Assembly code instructions are either directives or mnemonics: 


a) 


Assembler directives are commands for the assembler (asm6x) that 
control the assembly process or define the data structures (constants and 
variables) in the assembly language program. All assembler directives 
begin with a period, as shown in the partial list in Table 4—1. 


Processor mnemonics are the actual microprocessor instructions that 
execute at runtime and perform the operations in the program. Table 4—2 
summarizes the ’C62xx mnemonics. Processor mnemonics must begin in 
column 2 or greater. 


Figure 4—4 shows the position of the instruction in a line of assembly code. 


Figure 4—4. Instructions in Assembly Code 


label: parallel bars [condition] instruction unit operands ; comments 


Table 4-1. Selected TMS320C62xx Directives 


Directives Description 

sect “name* Creates section of information (data or code) 

-int value Reserve 32 bits in memory and fill with specified value 
long value 

.word value 

-short value Reserve 16 bits in memory and fill with specified value 
-half value 

-byte value Reserve 8 bits in memory and fill with specified value 


See the TMS320C6x Assembly Language Tools User’s Guide for a complete 
list of directives. 
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Table 4-2. Selected TMS320C62xx Instruction Mnemonics 


Arithmetic 


ABS 
ADD 
ADDA 
ADDK 
ADD2 
SADD 
SAT 
SSUB 
SUB 
SUBA 
SUBC 
SUB2 


Multiply 


MPY 
MPYH 
MPYHL 
MPYLH 
SMPY 


Program Bit 
Load/Store Control Management 
LD B CLR 
MVK B IRP EXT 
MVKH B NRP LMBD 
ST NORM 
SET 


Logical 


AND 
CMPEQ 
CMPGT 
CMPLT 
OR 

SHL 
SHR 
SSHL 
XOR 


Instructions 


Pseudo/Other 


See the TMS320C62xx CPU and Instruction Set Reference Guide for a com- 


plete list of instructions. 
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4.5 Functional Units 


The ’C62xx CPU contains eight functional units, which are shown in 
Figure 4—5 and described in Table 4—3. 


Figure 4-5. TMS320C62xx Functional Units 


+—_ S11 2 |—_» 
<> _ 11 .L2 | 1¢—_ 
Register Register 
file A —_; —_>; file B 
+—> MI M2 | ¢——_> 
NN NS 
+ _ D1 -D2 | 1g» 
— SSN SSN 
Memory 


Table 4—3. Functional Units and Descriptions 


Functional Unit Description 


-L unit (.L1, .L2) 32/40-bit arithmetic and compare operations 
Left most 1, 0, bit counting for 32 bits 
Normalization count for 32 and 40 bits 
32 bit logical operations 


.S unit (.S1, .S2) 32-bit arithmetic operations 
32/40 bit shifts and 32-bit bit-field operations 
32 bit logical operations 
Branching 
Constant generation 
Register transfers to/from the control register file 


-M unit (.M1,.M2) 16 x16 bit multiplies 
.D unit (.D1,.D2) 32-bit add, subtract, linear and circular address calcula- 
tion 


Figure 4-6 shows the position of the unit in a line of assembly code. 
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Figure 4—6. Units in the Assembly Code 


label: parallelbars [condition] instruction unit operands ; comments 


Specifying the functional unit in the assembly code is optional. The functional 
unit can be used to document which resource(s) each instruction uses. 
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4.6 Operands 


The ’C62xx architecture requires that memory reads and writes move data 
between memory and a register. Figure 4—7 shows the position of the oper- 
ands in a line of assembly code. 


Figure 4—7. Operands in the Assembly Code 


label: parallel bars [condition] instruction unit operands  ; comments 


Instructions have the following requirements for operands in the assembly 
code: 


L) All instructions require a destination operand. 
_j Most instructions require one or two source operands. 


1 The destination operand must be in the same register file as one source 
operand. 


[1 One source operand per execute packet can come from the register file 
opposite that of the other source operand. 


When an operand comes from the other register file, the unit includes an X, 
as shown in Figure 4—8, indicating that the instruction is using one of the 
cross paths. 


Figure 4—8. Operands in Instructions 


ADD ard. AO,A1,A3 


ADD .L1X A0O,B1,A3 


All registers except B1 are on the same side of the CPU. 


The 'C62xx instructions use three types of operands to access data: 
Lj Register operands indicate a register that contains the data. 


1 Constant operands specify the data within the assembly code. 


_j Pointer operands contain addresses of data values. 


Only the load and store instructions require and use pointer operands to 
move data values between memory and a register. 


Comments 


4.7 Comments 


As with all programming languages, comments provide code documentation. 
Figure 4—9 shows the position of the comment in a line of assembly code. 


Figure 4-9. Comments in Assembly Code 


label: parallelbars [condition] instruction unit operands ; comments 


The following are guidelines for using comments in assembly code: 
Lj} Acomment may begin in any column when preceded by a semicolon (;). 


Lj} Acomment must begin in first column when preceded by an asterisk (*). 


[1 Comments are not required but are recommended. 
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Chapter 5 


Optimizing Assembly Code 


This chapter describes methods that help you develop more efficient 
assembly language programs, understand the code produced by the 
assembly optimizer, and perform manual optimization. 


This chapter encompasses phase 3 of the code development flow. After you 
have developed and optimized your C code using the 'C6x compiler, extract 
the inefficient areas from your C code and rewrite them in linear assembly (as- 
sembly code that has not been register-allocated and is unscheduled). 


The assembly code shown in this chapter has been hand-optimized in order 
to direct your attention to particular coding issues. The actual output from the 
assembly optimizer may look different, depending on the version you are us- 
ing. 


Topic Page = 
Sul wAssembly:Code@ie acc cc ei tsneciosp ealersineeie aie aetna tcinus cbernersts 5-2 § 
5:2° Writing ParalleliCode’ 2222355222 cenaeee reer mente nneliea sees 5-4 fe 
5.3. Using Word Access for Short Data ............2000eeeeee eens 5-10 
5742 = SoftwaresPipe liming jicercterc cae teva ssern toca een cee ace) eeecereretore) notars ceeceeere arene) 5-16 
5.5 Modulo Scheduling of Multicycle Loops ................+++++:: 5-31 
5:65 Joop Garry Pathns roc ts ryt ton scutes enon atte ees nsatt ate cvetel po mte ciara apy ecole 5-50 
5.7 If-Then-Else Statements in a Loop .........-.-:200eeeeeee eens 5-59 
Byes [beoxeyonUinirell ale] GanopoaspdcpocnAcapodnonabdosooopaaundopouanpar 5-67 
SiG aLiveTOO-EONgiISSUCS merryeieryatete er evelehetsterelateteney ayerecelatateyereleleyensye) renee 5-74 
5.10 Redundant Load Elimination .............0 ccc cece eee eee eee 5-83 
SPU MEMONV EB AIK Siercperecer ctetersnen sce tevelecereyederarenetateveretatatetstateleeala Pere: afeferaierelrcate ca 5-91 
5.12 Software Pipelining the Outer Loop...........-.:-e0eeeeeeeee 5-104 
5.13 Outer Loop Conditionally Executed With Inner Loop ........... 5-109 


5-1 


Part Ill 


Assembly Code 


5.1 Assembly Code 
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The source that you write for the assembly optimizer is similar to assembly 
source code; however, linear assembly does not include information about 
parallel instructions, instruction latencies, or register usage. The assembly op- 
timizer takes care of the difficulties of streamlining your code by: 


Lj Finding instructions that can be executed in parallel 
_j Handling pipeline latencies during software pipelining 
Lj Assigning register usage 

(J Defining which unit to use 


Although you have the option with the 'C6x to specify the functional unit or reg- 
ister used, this may restrict the compiler’s ability to fully optimize your code. 
See the TMS320C6x Optimizing C Compiler User’s Guide for more informa- 
tion. 


This chapter takes you through the optimization process manually to show you 
how the assembly optimizer works and to help you understand when you might 
want to perform some of the optimizations manually. Each section introduces 
optimization techniques in increasing complexity: 


(1 Section 5.2 and section 5.3 begin with a dot product algorithm to show you 
how to translate the C code to assembly code and then how to optimize 
the assembly code with several simple techniques. 


(1 Section 5.4 and section 5.5 introduce techniques, such as modulo itera- 
tion interval scheduling for both single-cycle loops and multicycle loops, 
for the more complex algorithms associated with software pipelining. 


L1 Section 5.6 uses an IIR filter algorithm to discuss the problems with loop 
carry paths. 


L1 Section 5.7 and section 5.8 discuss the problems encountered with if- 
then-else statements in a loop and how loop unrolling can be used to re- 
solve them. 


J Section 5.9 introduces live-too-long issues in your code. 


L1 Section 5.10 uses a simple FIR filter algorithm to discuss redundant load 
elimination. 


LJ Section 5.11 discusses the same FIR filter in terms of the interleaved 
memory bank scheme used by ’C62xx devices. 


LJ Section 5.12 and section 5.13 show you how to execute the outer loop of 
the FIR filter conditionally and in parallel with the inner loop. 


Assembly Code 


Each example discusses the: 


[J Algorithm in C code 

(1 Translation of the C code to linear assembly 

_j Dependency graph to describe the flow of data in the algorithm 

Lj Allocation of resources (functional units, registers, and cross paths) 


Note: 


There are three types of code for the *C62xx: C code (which is input for the 
C compiler), linear assembly code (which is input for the assembly optimizer), 
and assembly code (which is input for the assembler). 
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5.2 Writing Parallel Code 


One way to optimize assembly code is to reduce the number of execution 
cycles in a loop. You can do this by rewriting linear assembly instructions so 
that they execute in parallel. 


5.2.1. Dot Product C Code 


The C code in Example 5-1 is a dot product algorithm. The dot product is a 
sum in which each element in array ais multiplied by the corresponding ele- 
ment in array b. Each of these products is then accumulated into sum. 


Example 5—1. Dot Product C Code 


int dotp(short a[], short b[] ) 
{ 


int sum, i; 
sum = 0; 


for(i=0; i<100; i++) 
sum += a[i] * b[i]; 


return (sum) ; 


5.2.2 Translating C Code to Linear Assembly 


Example 5—2 shows the linear assembly instructions used for the inner loop 
of the dot product C code. 


Example 5—2. List of Assembly Instructions for Dot Product 


LDH DEE *A4++,A2 ; load ai from memory 

LDH eDL *A3++,A5 ; load bi from memory 

MPY .M1 A2,A5,A6 j ai * bi 

ADD ~L1 A6,A7,A7 ; sum += (ai * bi) 

SUB .S1 Al1,1,Al1 ; decrement loop counter 
[Al] B “ol LOOP ; branch to loop 


The load halfword (LDH) instructions increment through the a and b arrays. 
Each LDH does a postincrement on the pointer. Each iteration of these instruc- 
tions sets the pointer to the next halfword (16 bits) in the array. The ADD in- 
struction accumulates the total of the results from the multiply (MPY) instruc- 
tion. The subtract (SUB) instruction decrements the loop counter. 


An additional instruction is included to execute the branch back to the top of 
the loop. The B (branch) instruction is conditional on the loop counter, A1, and 
executes only until A1 is 0. 
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5.2.3 Allocating Resources 


The following rules affect the assignment of functional units for Example 5-2 
(shown in the third column): 


UUUOU 


Load (LDH) instructions must use a .D unit. 
Multiply (MPY) instructions must use a .M unit. 
Add (ADD) instructions use a .L unit. 

Subtract (SUB) instructions use a .S unit. 
Branch (B) instructions must use a .S unit. 


Note: 


The ADD and SUB can be on the .S, .L, or .D units; however, for 
Example 5-2, they are assigned as listed above. 


5.2.4 Drawing a Dependency Graph 


Dependency graphs can help analyze loops by showing the flow of instruc- 
tions and data in an algorithm. These graphs also show how instructions 
depend on one another. The following terms are used in defining a depen- 
dency graph. 


L 


a 


a) 


A node is a point on a dependency graph with one or more data paths 
flowing in and/or out. 


The path shows the flow of data between nodes. The numbers beside 
each path represent the number of cycles required to complete the instruc- 
tion. 
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An instruction that writes to a variable is referred to as a parent instruction 
and defines a parent node. 


An instruction that reads a variable written by a parent instruction is re- 
ferred to as its child and defines a child node. 


Use the following steps to draw a dependency graph: 


Define the nodes based on the variables accessed by the instructions. 
Define the data paths that show the flow of data between nodes. 

Add the instructions and the latencies. 

Add the functional units. 


Figure 5—1 shows the dependency graph for the dot product assembly instruc- 
tions (shown in Example 5—2) and their corresponding register allocations. 
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Figure 5—1. Dependency Graph of Dot Product 


Instruction 
. —p LDH LDH 
mnemonic Functional 
unit 
Variable 
being 


written 


5 Register SUB 
y 4 allocation 
Number of cycles { St 
required to complete M1 7 
an instruction ae 
{ 
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(1 The two LDH instructions, which write the values of ai and bi, are parents 
of the MPY instruction. It takes five cycles for the parent (LDH) instruction 
to complete. Therefore, if LDH is scheduled on cycle i, then its child (MPY) 
cannot be scheduled until cycle i + 5. 


_j The MPY instruction, which writes the product pi, is the parent of the ADD 
instruction. The MPY instruction takes two cycles to complete. 


_j The ADD instruction adds pi (the result of the MPY) to sum. The output of 
the ADD instruction feeds back to become an input on the next iteration 
and, thus, creates a /oop carry path. (See section 5.6 on page 5-50 for 
more information on loop carry paths.) 


The dependency graph for this dot product algorithm has two separate parts 
because the decrement of the loop counter and the branch do not read or write 
any variables from the other part. 


[1 The SUB instruction writes to the loop counter, cntr. The output of the SUB 
instruction feeds back and creates a loop carry path. 


_j The B (branch) instruction is a child of the loop counter. 
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5.2.5 Comparing Performance (Nonparallel Versus Parallel Assembly Code) 


Example 5-3 shows the nonparallel (or linear) assembly code for the dot prod- 
uct loop. The MVK instruction initializes the loop counter to 100. The ZERO 
instruction clears the accumulator. The NOP instructions allow for the delay 
slots of the LDH, MPY, and B instructions. 


Executing this dot product code serially requires 16 cycles for each iteration 
plus two cycles to set up the loop counter and initialize the accumulator; 100 it- 
erations require 1602 cycles. 


Example 5—3. Nonparallel Assembly Code for Dot Product 


’ 


LOOP: 


[Al] 


Branch occurs here 


ww p 
G 
Ww 


OP 


100, Al ; set up loop counter 
A7 ; zero out accumulator 
*A4++,A2 ; load ai from memory 
*A3++,A5 load bi from memory 


delay slots for LDH 


A2,A5,A6 ai * bi 

delay slot for MPY 
A6,A7,A7 sum += (ai * bi) 
A1,1,Al1 decrement loop counter 
LOOP branch to loop 


Ne Ne Ne Ne Ne Ne Ne Ne OS 


delay slots for branch 


Assigning the same functional unit to both LDH instructions slows perfor- 
mance of this loop. Therefore, reassign the functional units to execute the 
code in parallel, as shown in the dependency graph in Figure 5—2. The parallel 
assembly code is shown in Example 5-4. 
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Figure 5—2. Dependency Graph of Dot Product with Parallel Assembly 
LDH 


SUB 


S1 


Example 5—4. Parallel Assembly Code for Dot Product 


’ 


MVK 
ZERO 


LDH 
LDH 
SUB 
[Al] B 
NOP 
PY 
NOP 
ADD 


-L1 


100, Al 
A7 


*A4++,A2 
*B4++,B2 
Al,1,Al1 
LOOP 


A2,B2,A6 


A6,A7,A7 


Branch occurs here 


ee eos 


set up loop counter 
zero out accumulator 


load ai from memory 
load bi from memory 
decrement loop counter 
branch to loop 

delay slots for LDH 


ais, -* ba. 
delay slots for MPY 
sum += (ai * bi) 
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Because the loads of ai and bi do not depend on one another, both LDH 
instructions can execute in parallel as long as they do not share the same 
resources. To schedule the load instructions in parallel, allocate the functional 
units as follows: 


Because the MPY instruction now has one source operand from A and one 


j ai and the pointer to ai to a functional unit on the A side, .D1 


Lj bi and the pointer to bi to a functional unit on the B side, .D2 


from B, MPY uses the 1X cross path. 


Writing Parallel Code 


Rearranging the order of the instructions also improves the performance of the 
code. The SUB instruction can take the place of one of the NOP delay slots 
for the LDH instructions. Moving the B instruction after the SUB removes the 
need for the NOP 5 used at the end of the code in Example 5-3. 


The branch now occurs immediately after the ADD instruction so that the MPY 
and ADD execute in parallel with the five delay slots required by the branch 
instruction. 


5.2.6 Comparing Performance 


Executing the dot product code in Example 5—4 requires eight cycles for each 
iteration plus one cycle to set up the loop counter and initialize the accumula- 
tor; 100 iterations require 801 cycles. 


Table 5-1 compares the performance of the nonparallel code with the parallel 
code. 


Table 5-1. Comparison of Nonparalle! and Parallel Code 


Code Example 100 Iterations = Cycle Count 
Example 5-3 Dot product nonparallel assembly 2+ 100 x 16 1602 
Example 5-4 Dot product parallel assembly 1+100 x 8 801 
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5.3 Using Word Access for Short Data 


The parallel code in section 5.2 uses an LDH instruction to read ali]. Because 
ali] and a[i + 1] are next to each other in memory, you can optimize the code 
further by using the load word (LDW) instruction to read ali] and a[i + 1] at the 
same time and load both into a single 32-bit register. (The data must be word- 
aligned in memory.) 


5.3.1. Unrolled Dot Product C Code 


The C code in Example 5-5 has the effect of unrolling the loop by accumu- 
lating the even elements, a[i] and b[i], into sum0 and the odd elements, ali + 1] 
and b[i + 1], into sum1. After the loop, sum0 and sum1 are added to produce 
the final sum. (For another example of loop unrolling, see section 5.8 on 
page 5-67.) 


Example 5—5. Dot Product C Code (Unrolled) 


int dotp(short a[], short b[] ) 
{ 


int sum0O, suml, sum, i; 


O; 

0; 

for (i=0; i<100; it=2) { 
sum0 += a[i] * bl[il; 
suml += a[i + 1] * b[i +1]; 
} 

sum = sum0 + suml; 

return(sum); 
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5.3.2 Translating C Code to Linear Assembly 


Example 5-6 shows the list of ’C62xx instructions that execute the unrolled dot 
product loop. Symbolic variable names are used instead of actual registers. 
Using symbolic names for data and pointers makes code easier to write and 
allows the optimizer to allocate registers. However, you must use the .reg as- 
sembly optimizer directive. See the TMS320C6x Optimizing C Compiler 
User’s Guide for more information on writing linear assembly code. 


Example 5—6. Linear Assembly for Dot Product Inner Loop with LDW 


[enter] 
{entr] 


LDW 
LDW 
MPY 
MPYH 
ADD 
ADD 
SUB 
B 


*at++,ai_il 
*b++,bi_il 


ak Abd 17 pa ai * ba 
ai_il,bi_il,pil ai+l * bitl 

pi,sum0, sum0 sum0O += (ai * bi) 
pil,suml, suml suml += (aitl * bi+1) 


centr, 41 cnte 
LOOP 


load ai & al from memory 
load bi & bl from memory 


decrement loop counter 
branch to loop 


The two load word (LDW) instructions load a[i], a[i+1], b[i], and b[i+1] on each 
iteration. 


Two MPY instructions are now necessary to multiply the second set of array 
elements: 


.) The first MPY instruction multiplies the 16 least significant bits (LSBs) in 
each source register: ali] x b[i]. 


[1 The MPYH instruction multiplies the 16 most significant bits (MSBs) of 
each source register: a[i+1] x b [i+1]. 


The two ADD instructions accumulate the sums of the even and odd elements: 
sum0 and sum1. 


Note: 


This is true only when the ’C62xx is in little-endian mode. In big-endian mode, 
MPY operates on a[i+1] and b[i+1] and MPYH operates on a[i] and b[i]. See 
the TMS320C62xx Peripherals Reference Guide for more information. 
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5.3.3 Drawing a Dependency Graph 


The dependency graph in Figure 5—3 shows that the LDW instructions are par- 
ents of the MPY instructions and the MPY instructions are parents of the ADD 
instructions. To split the graph between the A and B register files, place an 
equal number of LDWs, MPYs, and ADDs on each side. To keep both sides 
even, place the remaining two instructions, B and SUB, on opposite sides. 


Figure 5-3. Dependency Graph of Dot Product With LDW 


A side B side 


LDW ! LDW 


5.3.4 Allocating Resources 


After splitting the dependency graph, you can assign functional units and reg- 
isters, as shown in the dependency graph in Figure 5—4 and in the instructions 
in Example 5—7. The .M1X and .M2X represent a path in the dependency 
graph crossing from one side to the other. 
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Figure 5-4. Dependency Graph of Dot Product With LDW (Showing Functional Units) 


LDW 
- 
5 


B side 


LDW 
bi & bi+1 
5 


.D2 


-L2 


Example 5—7. Linear Assembly for Dot Product Inner Loop With LDW 
(With Allocated Resources) 


LDW -D1 
LDW -D2 
MPY .M1X 
MP YH .M2X 
ADD .L1 
ADD L2 
SUB woe: 
[Al] B «2 


*A4++,A2 
*B4++,B2 
A2,B2,A6 
A2,B2,B6 
A6,A7,A7 
B6,B7,B7 
Al,1,Al 
LOOP 


Ne Ne Ne Ne Ne Ne Ne Ne 


load ai and ai+1l from memory 
load bi and bi+1l from memory 


au. “sat 

ait+l * bitl 

sum0O += (ai * bi) 
suml += (aitl * bitl) 


decrement loop counter 
branch to loop 
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5.3.5 Final Assembly 


Example 5-8 shows the final assembly code for the unrolled loop, using LDW 
instructions instead of LDH instructions. 


Example 5-8. Assembly Code for Dot Product With LDW (Before Software Pipelining) 


MVK -S1 50,Al ; set up loop counter 
|| ZERO Pal rei Al ; zero out sum0 accumulator 
|| ZERO <LZ B7 ; zero out suml accumulator 
LOOP 

LDW -D1 *A4++,A2 ; load ai & ait+l from memory 
|| LDW «DZ *B4++,B2 ; load bi & bit+l from memory 

SUB -S1 Al,1,Al1 ; decrement loop counter 
[Al] B Cod LOOP ; branch to loop 

NOP 2 

PY .M1X A2,B2,A6 goede ® pd 
| | PYH .M2X A2,B2,B6 yj aitl * bit+l 

NOP 

ADD -L1 A6,A7,A7 ; sum0+= (ai * bi) 
| | ADD 22 B6,B7,B7 ; sumlt+= (ai+l * bitl) 

; Branch occurs here 

ADD ~L1X A7,B7,A4 ; sum = sum0 + suml 


The code in Example 5-8 includes the following optimizations: 


(1 The setup code for the loop is included to initialize the array pointers and 
the loop counter and to clear the accumulators. The setup code assumes 
that A4 and B4 have been initialized to point to arrays aand b, respectively. 


Lj The MVK instruction initializes the loop counter. 


L1 The two ZERO instructions, which execute in parallel, initialize the even 
and odd accumulators (sum0 and sum1) to 0. 


Lj The third ADD instruction adds the even and odd accumulators. 
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5.3.6 Comparing Performance 


Executing the dot product with the optimizations in Example 5—8 requires only 
50 iterations, because you operate in parallel on both the even and odd array 
elements. With the setup code and the final ADD instruction, 100 iterations of 
this loop require a total of 402 cycles (1+ 8 x 50 + 1). 


Table 5-2 compares the performance of the different versions of the dot 
product code discussed so far. 


Table 5-2. Comparison of Dot Product Code With Use of LDW 


Code Example 100 Iterations = Cycle Count 


Example 5-3 Dot product nonparallel assembly 2+ 100 x 16 1602 
Example 5-4 Dot product parallel assembly 1+100 x 8 801 
Example 5-8 Dot product parallel assembly with LDW 1+ (50x 8)+ 1 402 
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5.4 Software Pipelining 


This section describes the process for improving the performance of the as- 
sembly code in the previous section through software pipelining. 


Software pipelining is a technique used to schedule instructions from a loop 
so that multiple iterations execute in parallel. The parallel resources on the 
’C62xx make it possible to initiate a new loop iteration before previous itera- 
tions finish. The goal of software pipelining is to start a new loop iteration as 
soon as possible. 


The modulo iteration interval scheduling table is introduced in this section as 
an aid to creating software-pipelined loops. 


The dot product code in Example 5—8 needs eight cycles for each iteration of 
the loop: five cycles for the LDWs, two cycles for the MPYs, and one cycle for 
the ADDs. 


Figure 5-5 shows the dependency graph for the dot product instructions. 
Example 5—9 shows the same dot product assembly code in Example 5—7 on 
page 5-13, except that the SUB instruction is now conditional on the loop 
counter (A1). 


TS, 
Note: 


Making the SUB instruction conditional on A1 ensures that A1 stops decre- 
menting when it reaches 0. Otherwise, as the loop executes five more times, 
the loop counter becomes a negative number. When A1 is negative, it is non- 
zero and, therefore, causes the condition on the branch to be true again. If the 


SUB instruction were not conditional on A1, you would have an infinite loop. 
a) 


Software Pipelining 


Figure 5-5. Dependency Graph of Dot Product With LDW (Showing Functional Units) 
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Example 5-9. Linear Assembly for Dot Product Inner Loop 
(With Conditional SUB Instruction) 


LDW .D1 *A4++,A2 ; load ai and ait+l from memory 
LDW «D2 *B4++,B2 ; load bi and bit+l from memory 
MPY .M1X A2,B2,A6 7 Seas aoa 
MP YH .M2X A2,B2,B6 ; aitl * bitl 
ADD rai A6,A7,A7 ; sum0O += (ai * bi) 
ADD ~L2 B6,B7,B7 ; suml += (aitl * bi+l) 

[Al] SUB aS Al,1,Al1 ; Gecrement loop counter 

[Al] B «S2 LOOP ; branch to top of loop 
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5.4.1. Modulo Iteration Interval Scheduling 


Another way to represent the performance of the code is by looking at it ina 
modulo iteration interval scheduling table. This table shows how a 
software-pipelined loop executes and tracks the available resources on a 
cycle-by-cycle basis to ensure that no resource is used twice on any given 
cycle. The iteration interval of a loop is the number of cycles between the initia- 
tions of successive iterations of that loop. 


The code in Example 5-8 needs eight cycles for each iteration of the loop, so 
the iteration interval is eight. 


Table 5-3 shows a modulo iteration interval scheduling table for the dot prod- 
uct loop before software pipelining (Example 5-8). Each row represents a 
functional unit. There is acolumn for each cycle in the loop showing the instruc- 
tion that is executing on a particular cycle: 


_j LDWs on the .D units are issued on cycles 0, 8, 16, 24, etc. 

L} MPY and MPYH on the .M units are issued on cycles 5, 13, 21, 29, etc. 
_j ADDs on the .L units are issued on cycles 7, 15, 23, 31, etc. 

[1 SUB onthe .S1 unit is issued on cycles 1, 9, 17, 25, etc. 

Li Bon the .S2 unit is issued on cycles 2, 10, 18, 24, etc. 


Table 5-3. Modulo Iteration Interval Scheduling Table for Dot Product 
(Before Software Pipelining) 


ADD 
ADD 


In this example, each unit is used only once every eight cycles. 


Software Pipelining 


5.4.1.1. Determining the Minimum Iteration Interval 


Software pipelining increases performance by using the resources more effi- 
ciently. However, to create a fully pipelined schedule, it is helpful to first deter- 
mine the minimum iteration interval. 


The minimum iteration interval of a loop is the minimum number of cycles you 
must wait between each initiation of successive iterations of that loop. The 
smaller the iteration interval, the fewer cycles it takes to execute a loop. 


Resources and data dependency constraints determine the minimum iteration 
interval. The most-used resource constrains the minimum iteration interval. 
For example, if four instructions in a loop all use the .S1 unit, the minimum it- 
eration interval is at least 4. Four instructions using the same resource cannot 
execute in parallel and, therefore, require at least four separate cycles to 
execute each instruction. 


With the SUB and branch instructions on opposite sides of the dependency 
graph in Figure 5-5, all eight instructions use a different functional unit and no 
two instructions use the same cross paths (1X and 2X). Because no two 
instructions use the same resource, the minimum iteration interval based on 
resources is 1. 


——>@ZWWDOOOO EE 
Note: 


In this particular example, there are no data dependencies to affect the 
minimum iteration interval. However, future examples may demonstrate this 


constraint. 
Ln _—————————————————————————————— et) 
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5.4.1.2 Creating a Fully Pipelined Schedule 


Having determined that the minimum iteration interval is 1, you can initiate a 
new iteration every cycle. You can schedule LDW and MPY instructions on 
every cycle. Table 5—4 shows a fully pipelined schedule for the dot product ex- 
ample. 


Table 5—4. Modulo Iteration Interval Table for Dot Product (After Software Pipelining) 
-—_—-—__— Loop Prolog _———— rl 


7, 8, 9... 
M2 MPYH 
Low 

Low 

SUB 

a 


Note: The asterisks indicate the iteration of the loop; shading indicates the single-cycle loop. 


The rightmost column is a single-cycle loop that contains the entire loop. 
Cycles 0-6 are loop setup code, or loop prolog. 


Asterisks define which iteration of the loop the instruction is executing each 
cycle. For example, the rightmost column shows that on any given cycle inside 
the loop: 


The ADD instructions are adding data for iteration n. 

The MPY instructions are multiplying data for iteration n + 2 (**). 
The LDW instructions are loading data for iteration n + 7 (*******). 
The SUB instruction is executing for iteration n + 6 (******). 

The branch instruction is executing for iteration n + 5 (*****). 


UOUUOU 
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In this case, multiple iterations of the loop execute in parallel in a software pipe- 
line that is eight iterations deep, with iterations n through n + 7 executing in par- 
allel. Software pipelines are rarely deeper than the one created by this single- 
cycle loop. As loop sizes grow, the number of iterations that can execute in par- 
allel tends to become fewer. 


5.4.2 Using the Assembly Optimizer to Create Optimized Loops 


Example 5—10 shows the linear assembly code for the full dot product loop. 
You can use this code as input to the assembly optimizer tool to create soft- 
ware pipelined loops automatically. See the TMS320C6x Optimizing C Com- 
piler User’s Guide for more information on the assembly optimizer. 


Example 5-10. Linear Assembly for Full Dot Product 


_dotp: .cproc 


.reg 
.reg 


LOOP: -trip 50 


[contr] 
{entr] 


ADD 


-global _dotp 


a, b 


.-return sum 


.endproc 


sum, sum0, suml, cntr 
aiid, bi_i1,. pi, pil 


50,cntr ; centr = 100/2 

sum0 ; multiply result = 0 

suml ; multiply result = 0 
kat+,ai_il ; load ai & al from memory 
potty OE Ay ; load bi & bl from memory 
ai_il,bi_il,pi ; ai * bi 
ai_il,bi_il,pil; aitl * bitl 

pi,sum0, sum0 ; sum0 += (ai * bi) 
pil,suml, suml ; suml += (aitl * bitl) 
entry, 1,ecntr ; decrement loop counter 
LOOP ; branch to loop 

sum0, suml, sum ; compute final result 


Resources such as functional units and 1X and 2X cross paths do not have 
to be specified because these can be allocated automatically by the assembly 
optimizer. 
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5.4.3 Final Assembly 
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Example 5-11 shows the assembly code for the software pipelined dot prod- 
uct in Table 5-4. The accumulators are initialized to 0 and the loop counter is 
set up in the first execute packet in parallel with the first LDW instructions. The 
asterisks in the comments correspond with those in Table 5—4. 


CL — ————— ——— ——— —— ——— ———— ———————————— 
Note: 


All instructions executing in parallel constitute an execute packet. An exe- 
cute packet can contain up to eight instructions. 


See the TMS320C62xx CPU and Instruction Set Reference Guide for more 


information about pipeline operation. 
a | 


Multiple branch instructions are in the pipe. The first branch is issued on 
cycle 2 but does not actually branch until the end of cycle 7 (after five delay 
slots). The branch target is the execute packet defined by the label LOOP. On 
cycle 7, the first branch returns to the same execute packet, resulting in a 
single-cycle loop. On every cycle after cycle 7, a branch executes back to 
LOOP until the loop counter finally decrements to 0. Once the loop counter is 
0, five more branches execute because they are already in the pipe. 


Executing the dot product code with the software pipelining as shown in 
Example 5—11 requires a total of 58 cycles (7 + 50 + 1), which is a significant 
improvement over the 402 cycles required by the code in Example 5-8. 


————— ———————— ————— — — — — —— — ——— — — — —  —— — — — —— — —  — sa aw 
Note: 


The code created by the assembly optimizer will not completely match the 
final assembly code shown in this and future sections because different ver- 
sions of the tool will produce slightly different code. However, the inner loop 


performance (number of cycles per iteration) should be similar. 
a | 
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Example 5—11. Assembly Code for Dot Product (Software Pipelined) 


[Al] 


[Al] 
[Al] 


[Al] 
[Al] 


LOOP: 


[Al] 
[Al] 


LDW 2D *A4++,A2 
LDW ~DzZ *B4++ 7 Be 
MVK od 50,Al1 
ZERO «Ld A7 

ZERO Pa B7 

SUB Sl Al,1,Al 
LDW .D1 *A4++,A2 
LDW ‘LA *B4++,B2 
SUB .S1 Al,1,Al 
B eo LOOP 

LDW wt *A4++,A2 
LDW .D2 *BA++ BZ 
SUB .S1 Al,1,Al 
B 782 LOOP 

LDW <DEL *A4++,A2 
LDW .D2 eB4A++, B2 
SUB Sl Al,1,Al 
B 32 LOOP 

LDW .Dd *A4++,A2 
LDW .D2 *B4++,B2 
MPY .M1X A2,B2,A6 
MPYH ~M2X A2,B2,B6 
SUB Od Al,1,Al 
B -S2 LOOP 

LDW .D1 *A4++,A2 
LDW .D2 *B4++,B2 
MPY .M1X A2,B2,A6 
MPYH ~M2X A2,B2,B6 
SUB Ago | Al,1,Al 
B 282 LOOP 

LDW .D1 *A4++,A2 
LDW «D2 *B4++,B2 
ADD ~L1 A6,A7,A7 
ADD x LiZ B6,B7,B7 
MPY .M1X A2,B2,A6 
MPYH ~M2X A2,B2,B6 
SUB io Al,1,Al 
B .S2 LOOP 

LDW .D1 *A4++,A2 
LDW zD2 *B4++,B2 
; Branch occurs here 
ADD .L1X A7,B7,A4 


, 


, 


’ 


load ai & ai+l from memory 
load bi & bi+l from memory 
set up loop counter 

zero out sum0O accumulator 
zero out suml accumulator 


decrement loop counter 
* load ai & ait+l from memory 


branch to loop 


7;** load ai & ait 
7** load bi & bit 


, 


7*** load ai & aiq 
7*** load bi & bit 


;* load bi & bitl from memory 
;* decrement loop counter 


+1 from memory 
+1 from memory 


** decrement loop counter 


;* branch to loop 


+l from memory 


+l from memory 


;*** decrement loop counter 


’ 


;** branch to loop 


7**** load ai & ai+l from memory 
7**** load bi & bi+l from memory 


’ 


ai * bi 
ail bat 


7; **** decrement loop counter 
*** branch to loop 


pe***e Td ai & ais 
pPEREX Ta bt, & bi4 


’ 


’ 


PR aw De 
;* aitl * bitl 


+l from memory 


+l from memory 


7; ***** decrement loop counter 
7**** branch to loop 

7 xxxxe* Td ai & aitl from memory 
7 xxxxe* Td bi & bitl from memory 


; sum0 += (ai * bi) 

; suml += (aitl * bitl) 
rte ad, Sh wae 

7** aitl * bitl 


7; xx*x*x*** decrement loop counter 
7***** branch to loop 

7 xxxxee* Td ai & aitl fm memory 
pe***ee* 1q bi & bitl fm memory 


sum = 


sum0O + suml 
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5.4.3.1. Removing Extraneous Instructions 
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The code in Example 5—11 executes extra iterations of some of the instruc- 
tions in the loop. The following operations occur in parallel on the last cycle of 
the loop: 


[j Iteration 50 of the ADD instructions 
Lj Iteration 52 of the MPY and MPYH instructions 
Lj Iteration 57 of the LDW instructions 


In most cases, extra iterations are not a problem; however, when extraneous 
LDWs access unmapped memory, you can get unpredictable results. If the ex- 
traneous instructions present a potential problem, remove the extraneous 
LDW and MPY instructions by adding an epilog like that included in the second 
part of Example 5-12 on page 5-26. 


To eliminate LDWs from the iterations 51 through 57, run the loop seven fewer 
times. This brings the loop counter to 43 (50 — 7), which means you still must 
execute seven more cycles of ADD instructions and five more cycles of MPY 
instructions. Five pairs of MPYs and seven pairs of ADDs are now outside the 
loop. The LDWs, MPYs, and ADDs all execute exactly 50 times. (The shaded 
areas of Example 5—12 indicate the changes in this code.) 


Executing the dot product code in Example 5—12 with no extraneous LDWs 
still requires a total of 58 cycles (7 + 43 + 7 + 1), but the code size is now larg- 
er. 
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Example 5-12. Assembly Code for Dot Product (Software Pipelined 
With No Extraneous Loads) 


[Al] 


[Al] 
[Al] 


[Al] 
[Al] 


Al] 
[Al] 


[Al] 
[Al] 


[Al] 
[Al] 


PYH 


PYH 


-D1 
-D2 
pishl 
.L1 
~L2 


SL 
-D1 
D2 


ast 
2o2 
-D1 
D2 


mesyi 
“92 
-D1 
D2 


oi 
202 
-D1 
D2 


.M1X 
.M2X 
mow 
“02 
-D1 
-D2 


.M1X 
.M2X 
ol 
02 
AD 
-D2 


*A4++,A2 
*B4++,B2 
43,Al 

AT 

BT 
Al,1,Al 
*A4++,A2 
*B4++,B2 
Al,1,Al 
LOOP 
*A4++,A2 
*B4++,B2 
Al,1,A1 
LOOP 
*A4++,A2 
*B4++,B2 
Al,1,Al 
LOOP 
*A4++,A2 
*B4++,B2 
A2,B2,A6 
A2,B2,B6 
Al,1,Al 
LOOP 
*A4++,A2 
*B4++,B2 
A2,B2,A6 
A2,B2,B6 
Al, 1A 
LOOP 
*A4++,A2 
*B4++,B2 


, 


a 


load ai & ai+l from memory 
load bi & bi+1l from memory 
set up loop counter 

zero out sum0O accumulator 
zero out suml accumulator 


decrement loop counter 
* load ai & ai+l from memory 
* load bi & bit+l from memory 


* decrement loop counter 
branch to loop 
** load ai & ait+l from memory 


;** load bi & bitl from memory 


;** decrement loop counter 


* branch to loop 


;*** load ai & ait+l from memory 
7*** load bi & bi+l from memory 


, 


, 


;*** decrement loop counter 
;** branch to loop 


7**** load ai & aitl from memory 
7**** load bi & bi+l from memory 


, 


, 


ail * bi 
ai+l * bit+l 


;**** decrement loop counter 
7*** branch to loop 

7***** Td ai & aitl from memory 
7***** Td bi & bi+l from memory 


, 


, 


Pe cad. +a 
pte eral. Ma ted) 


7***** decrement loop counter 
7**** branch to loop 

;*xxx**x Ld ai & ai+l from memory 
7xxxx** Td bi & bitl from memory 
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Example 5-12. Assembly Code for Dot Product (Software Pipelined 
With No Extraneous Loads) (Continued) 
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LDW 


-L1 
-L2 
-M1 
-M2 
-S1 
-S2 
-D1 
-D2 


x 
x 


AG6,AT7,A7 
B6,B7,B7 
A2,B2,A6 
A2,B2,B6 
Al,1,Al 
LOOP 
*R4++,A2 
*B4++,B2 


; Branch occurs here 


ADD 
ADD 
MPY 
MPYH 


ADD 
ADD 
MPY 
MPYH 


ADD 
MPY 
MPYH 


ADD 
ADD 
MES 
MPYH 


ADD 
ADD 
MPY 
MPYH 


ADD 
ADD 


ADD 
ADD 


ADD 


-M2 


xX 


1X 


A6,AT7,A7 
B6,B7,B7 
A2,B2,A6 
A2,B2,B6 


AG6,AT7,A7 
B6,B7,B7 
A2,B2,A6 
A2,B2,B6 


13618) BVT 
A2,B2,A6 
A2,B2,B6 


A6,AT7,A7 
B6,B7,B7 
A2,B2,A6 
A2,B2,B6 


A6,AT7,A7 
B6,B7,B7 
A2,B2,A6 
A2,B2,B6 


A6,A7,A7 
BiG aren 


A6,A7,A7 
B6,B7,B7 


A7,B7,A4 


v 


’ 


sum0 
suml 


7** al 


, 


’ 


, 


’ 


’ 


, 


’ 


’ 


, 


’ 


’ 


’ 


es emeer Cilia 


(al 


*-sIoulk)) 


= (aitl * bitl) 


uae ore 


* bitl 

7**x**x*x* decrement loop counter 
7***** branch to loop 

7xxxexee* Td ai & aitl fm memory 
7 xxxexke* Td bi & bitl fm memory 


ADDs 
+= (ai * bi) 
fp (@labar dl. % dojalsril)) © 
ox Toya 
eS lojatapal 
+= (ai * bi) 
fp (Glakar dl. % dojalril)) ® 
mS Toya 
eS lojalaral 
a= A(elabaril, = dopalsril)) 
* bit 
+= (ai * bi) 
fp (@labardl. 2 loyal.) ® 
eelople 
* bit 
+=" (aa * bil) 
+= (aitl * biti) © 
alone 
* bit 
+=" (aa * bi) 
+= (ai+l * bi+1) © 
+= (ai * bi) 
+= (ai+l * bi+1) @ 
sum0 + suml 


MPYs 
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5.4.3.2 Priming the Loop 


Although Example 5-12 executes as fast as possible, the code size can be 
smaller without significantly sacrificing performance. To help reduce code 
size, you can use a technique called priming the loop. Assuming that you can 
handle extraneous LDWs, start with Example 5—11, which has no epilog and, 
therefore, fewer instructions. (This technique can be used equally well with 
Example 5-12.) 


To eliminate the prolog and, therefore, the extra LDW and MPY instructions, 
begin execution at the loop body (at the LOOP label). Eliminating the prolog 
means that: 


Li Two LDWs, two MPYs, and two ADDs occur in the first execution cycle of 
the loop. 


_j Because the first LDWs require five cycles to write results into a register, 
the MPYs do not multiply valid data until after the loop executes five times. 
The ADDs have no valid data until after seven cycles (five cycles for the 
first LDWs and two more cycles for the first valid MPYs). 


Example 5—13 shows the loop without the prolog but with four new instructions 
that zero the inputs to the MPY and ADD instructions. Making the MPYs and 
ADDs use Os before valid data is available ensures that the final accumulator 
values are unaffected. (The loop counter is initialized to 57 to accommodate 
the seven extra cycles needed to prime the loop.) 


Because the first LDWs are not issued until after seven cycles, the code in 
Example 5—13 requires a total of 65 cycles (7 + 57+ 1). Therefore, you are re- 
ducing the code size with a slight loss in performance. 
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Example 5-13. Assembly Code for Dot Product (Software Pipelined — No Prolog 


or Epilog) 
MVK ish 57,Al ; set up loop counter 
[Al] SUB oS Al1,1,Al1 ; decrement loop counter 
ZERO 1 A7 yj; zero out sum0 accumulator 
ZERO 12 B7 ; zero out suml accumulator 
[Al] SUB work Al1,1,Al1 7* decrement loop counter 
[Al] B SZ LOOP ; branch to loop 
ZERO 1 A6 ; zero out add input 
ZERO 12 Bo ; zero out add input 
[Al] SUB ¥S1 Al1,1,Al1 7** decrement loop counter 
[Al] B 482 LOOP 7* branch to loop 
ZERO 1 A2 ; zero out mpy input 
ZERO 12 B2 ; zero out mpy input 
Al] SUB ek Al,1,Al1 7*** decrement loop counter 
Al] B 652 LOOP 7** branch to loop 
Al] SUB -S1 Al1,1,Al1 ;**** decrement loop counter 
Al] B ve LOOP 7*** branch to loop 
Al] SUB -Sl1 Al1,1,Al1 7 **x*x** decrement loop counter 
Al] B “82 LOOP 7**** branch to loop 
LOOP 
ADD L1 A6,A7,A7 ; sum0 += (ai * bi) 
| | ADD L2 B6,B7,B7 ; suml += (ait+l * bit+l) 
|| MPY 1x A2,B2,A6 BA melas, *4 Wa: 
|| MPYH ~M2X A2,B2,B6 p** aitl * bitl 
| | [Al] SUB -S1 Al1,1,Al1 7 **x*x*x* decrement loop counter 
|| [Al] B LSZ LOOP PRAREK: branch <to. loop 
|| LDW D1 *A4++,A2 7xxxxxe* Td ai & aitl fm memory 
|| LDW D2 *B4++,B2 7xxxeee* Td bi & bitl fm memory 
; Branch occurs here 
ADD ~L1X A7,B7,A4 ; sum = sum0 + suml 


5-28 


Software Pipelining 


5.4.3.3. Removing Extra SUB Instructions 


To reduce code size further, you can remove extra SUB instructions. If you 
know that the loop count is at least 6, you can eliminate the extra SUB instruc- 
tions as shown in Example 5-14. The first five branch instructions are made 
unconditional, because they always execute. (If you do not know that the loop 
count is at least 6, you must keep the SUB instructions that decrement before 
each conditional branch as in Example 5-13.) Based on the elimination of six 
SUB instructions, the loop counter is now 51 (57 — 6). This code shows some 
improvement over Example 5-13. The loop in Example 5-14 requires 63 
cycles (5 + 57 + 1). 


Example 5—14. Assembly Code for Dot Product (Software Pipelined 
With Smallest Code Size) 


B oz LOOP ; branch to loop 
1 | MVK .S1 51,Al1 ; set up loop counter 
B £52 LOOP 7* branch to loop 
B cee LOOP ;** branch to loop 
ZERO «dal A7 ; zero out sum0 accumulator 
ZERO «bz B7 ; zero out suml accumulator 
B ne LOOP 7;*** branch to loop 
ZERO glad. A6 ; zero out add input 
ZERO eZ Bo ; zero out add input 
B ez LOOP 7**** branch to loop 
ZERO -L1 A2 ; zero out mpy input 
ZERO ~L2 B2 ; zero out mpy input 
LOOP: 
ADD ~L1 A6,A7,A7 ; sum0 += (ai * bi) 
| ADD ~L2 B6,B7,B7 ; suml += (ait+l * bitl) 
| MPY .M1X A2,B2,A6 eRe wad Oe abab 
| MP YH .M2X A2,B2,B6 7** ait+tl * bitl 
| [AL] SUB .S1 A1l,1,Al1 7; xxx*x** decrement loop counter 
| [Al] B ~S2 LOOP 7; ***** branch to loop 
| LDW .D1 *A4++,A2 7 xxxxee* Td ai & aitl fm memory 
| LDW $D2 *B4++,B2 7 xxxxee* Td bi & bitl fm memory 
; Branch occurs here 
ADD .L1X A7,B7,A4 ; sum = sum0 + suml 
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5.4.4 Comparing Performance 


Table 5-5 compares the performance of all versions of the dot product code. 


Table 5—5. Comparison of Dot Product Code Examples 


Code Example 100 Iterations Cycle Count 
Example 5-3 Dot product nonparallel assembly 2+100 x 16 1602 
Example 5-4 Dot product parallel assembly 1+100 x 8 801 
Example 5-8 Dot product parallel assembly with LDW 1+(50 x 8)+1 402 
Example 5-11 Software-pipelined dot product 7+50+1 58 
Example 5-12 Software-pipelined dot product with no extraneous loads 7+434+74+1 58 
Example 5-13 Software-pipelined dot product with no prolog or epilog 7+ 57+ 1 65 
Example 5-14 Software-pipelined dot product with smallest code size 5+57+1 63 
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5.5 Modulo Scheduling of Multicycle Loops 


Section 5.4 demonstrated the modulo-scheduling technique for the dot 
product code. In that example of a single-cycle loop, none of the instructions 
used the same resources. Multicycle loops can present resource conflicts 
which affect modulo scheduling. This section describes techniques to deal 
with this issue. 


5.5.1. Weighted Vector Sum C Code 


Example 5-15 shows the C code for a weighted vector sum. 


Example 5-15. Weighted Vector Sum C Code 


void w_vec(short a[],short b[],short c[],short m) 


{ 


deri ae 


for (i=0; i<100; i++) { 
c[i] = ((m * a[i]) >> 15) + b[il; 
} 


5.5.2 Translating C Code to Linear Assembly 


Example 5-16 shows the linear assembly that executes the weighted vector 
sum in Example 5—15. This linear assembly does not have functional units as- 
signed. The dependency graph will help in those decisions. However, before 
looking at the dependency graph, the code can be optimized further. 


Example 5-16. Linear Assembly for Weighted Vector Sum Inner Loop 


LDH xaptrt+t+,al ; ai 

LDH *bptrt+t+,bi PL 

MPY m,ai,pi 7; m * ai 

SHR pi,15,pi_scaled ; (m * ai) >> 15 

ADD pi_scaled,bi,ci ; ci = (m * ai) >> 15 + bi 

STH Ci, *cpturt+ ; store ci 
{cntr] SUB entr,1,cntr ; decrement loop counter 
{cntr]B LOOP ; branch to loop 
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5.5.3 Determining the Minimum Iteration Interval 


Example 5—16 includes three memory operations in the inner loop (two LDHs 
and the STH) that must each use a .D unit. Only two .D units are available on 
any single cycle; therefore, this loop requires at least two cycles. Because no 
other resource is used more than twice, the minimum iteration interval for this 
loop is 2. 


Memory operations determine the minimum iteration interval in this example. 
Therefore, before scheduling this assembly code, unroll the loop and perform 
LDWs to help improve the performance. 


5.5.3.1 Unrolling the Weighted Vector Sum C Code 


Example 5—17 shows the C code for an unrolled version of the weighted vector 
sum. 


Example 5-17. Weighted Vector Sum C Code (Unrolled) 


void w_vec(short a[],short b[],short c[],short m) 
{ 


pig ey ernie 


for (i=0; i<100; i+=2) { 
e[i] = ((m * a[i]) >> 15) + b[i]; 
c[itl] = ((m * a[it+l]) >> 15) + b[itl]; 
} 
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5.5.3.2. Translating Unrolled Inner Loop to Linear Assembly 


Example 5—18 shows the linear assembly that calculates c[i] and c[i+1] for the 
weighted vector sum in Example 5-17. 


a) 


L] 
L} 


The two store pointers (*ciptr and *ci+1ptr) are separated so that one 
(*ciptr) increments by 2 through the odd elements of the array and the 
other (*ci+1ptr) increments through the even elements. 


AND and SHR separate bi and bi+1 into two separate registers. 


This code assumes that mask is preloaded with OxOOOOFFFF to clear the 
upper 16 bits. The shift right of 16 places bi+1 into the 16 LSBs. 


Example 5-18. Linear Assembly for Weighted Vector Sum Using LDW 


LDW xaptrt+t+,ai_itl ; al & aitl 

LDW *bptrt+t+,bi_it+l ; bi & bitl 

MPY m,ai_i+l,pi 7; m * ai 

MPYHL m,ai_itl,pit+l ; m * aitl 

SHR pi,15,pi_scaled ; (m * ai) >> 15 

SHR pitl,15,pitl_scaled Alte sary oS >. 5 

AND bi_it1l,mask,bi ; bi 

SHR bi_it+1,16,bitl ; bitl 

ADD pi_scaled,bi,ci 7; ci = (m * ai) >> 15 + bi 

ADD pitl_scaled,bi+1,citl ; citl = (m * aitl) >> 15 + bitl 

STH Ci, *ciptr++[2] ; store ci 

syusl ci+l, *citlptr++[2] ; store citl 
{cntr] SUB entr,1,cntr ; decrement loop counter 
{cntr]B LOOP ; branch to loop 


5.5.3.3. Determining a New Minimum Iteration Interval 


Use the following considerations to determine the minimum iteration interval 
for the assembly instructions in Example 5-18: 


L 


L 


Four memory operations (two LDWs and two STHs) must each use a .D 
unit. With two .D units available, this loop still requires only two cycles. 


Four instructions must use the .S units (three SHRs and one branch). With 
two .S units available, the minimum iteration interval is still 2. 


The two MPYs do not increase the minimum iteration interval. 


Because the remaining four instructions (two ADDs, AND, and SUB) can 
all use a .L unit, the minimum iteration interval for this loop is the same as 
in Example 5-16. 


By using LDWs instead of LDHs, the program can do twice as much work in 
the same number of cycles. 
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5.5.4 Drawing a Dependency Graph 


To achieve a minimum iteration interval of 2, you must put an equal number 
of operations per unit on each side of the dependency graph. Three operations 
in one unit on a side would result in an minimum iteration interval of 3. 


Figure 5—6 shows the dependency graph divided evenly with a minimum itera- 
tion interval of 2. 


Figure 5-6. Dependency Graph of Weighted Vector Sum 


A side 
LDW 
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5.5.5 Allocating Resources 


Using the dependency graph, you can allocate functional units and registers 
as shown in Example 5—19. This code is based on the following assumptions: 


Lj The pointers are initialized outside the loop. 
Lj m resides in B6, which causes both .M units to use a cross path. 
_j The mask in the AND instruction resides in B10. 


Example 5-19. Linear Assembly for Weighted Vector Sum With Resources Allocated 


LDW .D1 
LDW .D2 
MPY .MLX 
MPYHL .M2x 
SHR .S1 
SHR .S2 
AND .h2 
SHR .S2 
ADD .LLX 
ADD 212 
STH .D1 
STH .D2 
[Al] SUB .L1 
[Al] B .S1 


*A4++,A2 ; al & aitl 

*B4++,B2 ; bi & bitl 

A2,B6,A5 7 pl =m * aL 

A2,B6,B5 ; pitl =m * aitl 

A5,15,A7 ; pi_scaled = (m * ai) >> 15 
B5,15,B7 ; pitl_scaled = (m * ait+l) >> 15 
B2,B10,B8 Bal 

B2,16,Bl1 ; bitl 

A7,B8,A9 7; ci = (m * ai) >> 15 + bi 
B7,B1,B9 ; citl = (m * ait+l) >> 15 + bitl 
A9, *A6+4+[2] ; store ci 

BY, *BO++[2] ; store cit] 

Al,1,Al1 ; decrement loop counter 

LOOP 7 branch to loop 


5.5.6 Modulo Iteration Interval Scheduling 


Part Ill 


Table 5-6 provides a method to keep track of resources that are a modulo it- 
eration interval away from each other. In the single-cycle dot product example, 
every instruction executed every cycle and, therefore, required only one set 
of resources. Table 5-6 includes two groups of resources, which are neces- 
sary because you are scheduling a two-cycle loop. 


(1 Instructions that execute on cycle k also execute on cycle k + 2, k + 4, etc. 
Instructions scheduled on these even cycles cannot use the same 
resources. 


(1 _ Instructions that execute on cycle k + 1 also execute on cycle k + 3, k + 5, 
etc. Instructions scheduled on these odd cycles cannot use the same 
resources. 


(J Because two instructions (MPY and ADD) use the 1X path but do not use 
the same functional unit, Table 5-6 includes two rows (1X and 2X) that 
help you keep track of the cross path resources. 
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Only seven instructions have been scheduled in this table. 


_j] The two LDWs use the .D units on the even cycles. 


Lj} The MPY and MPYH are scheduled on cycle 5 because the LDW has four 
delay slots. The MPY instructions appear in two rows because they use 
the .M and cross path resources on cycles 5, 7, 9, etc. 


[1 The two SHR instructions are scheduled two cycles after the MPY to allow 
for the MPY’s single delay slot. 


_j The AND is scheduled on cycle 5, four delay slots after the LDW. 
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Table 5-6. Modulo Iteration Interval Table for Weighted Vector Sum (2-Cycle Loop) 


KaRK KEKE 


LDW ai i+1 | LDWai_i+1 | LDWai_i+1 | LDWaii+t | LDWaii+1 | LDWai_i+1 


* kk kK KARK KEKE 


LDW bi i+1 | LDOWbi_i+1 | LDWbi_i+1 | LDWbii+t | LDWbi_i+1 | LDWbi_i+1 


1X : 
heed MPYHL pi+1 | MPYHL pi+1 | MPYHLpi+t | MPYHL pi+t 


Note: The asterisks indicate the iteration of the loop; shaded cells indicate cycle 0. 
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5.5.6.1 Resource Conflicts 


Resources from one instruction cannot conflict with resources from any other 
instruction scheduled modulo iteration intervals away. In other words, for a 
two-cycle loop, instructions scheduled on cycle n cannot use the same re- 
sources as instructions scheduled on cycles n+2,n+4,n+6, etc. Table 5-7 
shows the addition of the SHR bi+1 instruction. This must avoid a conflict of 
resources in cycles 5 and 7, which are one iteration interval away from each 
other. 


Even though LDW bi_i+1 (.D2, cycle 0) finishes on cycle 5, its child, SHR bi+1, 
cannot be scheduled on .S2 until cycle 6 because of a resource conflict with 
SHR pi+1_scaled, which is on .S2 in cycle 7. 


Figure 5—7. Dependency Graph of Weighted Vector Sum (Showing Resource Conflict) 
A side B side 


MPY. 


Scheduled 
on cycle 5 
AND SHR 


SHR 


Scheduled 


on cycle 7 pi_scaled 
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Table 5—7. Modulo Iteration Interval Table for Weighted Vector Sum With SHR Instructions 


| Unit / Cycle | / |Unit/Cycle| 0 


LDW ai_i+1 | LDW ai_i+1 LDW ai_i+1 LDW ai_i+1 
LDW bi_i+1 | LDW bi_i+1 LDW bi_i+1 LDW bi_i+1 


LDW ai_i+1 


LDW bi_i+1 


KKKKK 


LDW ai_i+1 


LDW bi_i+1 


* 


SHR bi+1 


kk 


SHR bi+1 


.D2 


M1 


*k 


MPY p 


.M2 


*k 


MPYHL pi+1 


Part Ill 


MPYHL pi+1 


MPY pi i 
MPYHL pi+1 | MPYHL pi+1 
AND bi AND bi 


*k 


AND bi 


SHR pi+1_s 


KKK 


AND bi 


SHR pi+1_s 


*k 


MPY p 


KKK 


MPY pi 


Note: 
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*k 


MPYHL pi+1 


KKK 


MPYHL pi+1 


The asterisks indicate the iteration of the loop; shading indicates changes in scheduling from Table 5-6. 
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5.5.6.2 Live Too Long 


Scheduling SHR bi+1 on cycle 6 now creates a problem with scheduling the 
ADD ci instruction. The parents of ADD ci (AND bi and SHR pi_scaled) are 
scheduled on cycles 5 and 7, respectively. Because the SHR pi_scaled is 
scheduled on cycle 7, the earliest you can schedule ADD ci is cycle 8. 


However, in cycle 7, AND bi * writes bi for the next iteration of the loop, which 
creates a scheduling problem with the ADD ci instruction. If you schedule 
ADD cion cycle 8, the ADD instruction reads the parent value of bi for the next 
iteration, which is incorrect. The ADD ci demonstrates a live-too-long problem. 


No value can be live in a register for more than the number of cycles in the loop. 
Otherwise, iteration n + 1 writes into the register before iteration n has read that 
register. Therefore, in a 2-cycle loop, a value is written to a register at the end 
of cycle n, then all children of that value must read the register before the end 
of cycle n + 2. 


5.5.6.3 Solving the Live-Too-Long Problem 
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The live-too-long problem in Table 5—7 means that the bi value would have to 
be live from cycles 6-8, or 3 cycles. No loop variable can live longer than the 
iteration interval, because a child would then read the parent value for the next 
iteration. 


To solve this problem move AND bi to cycle 6 so that you can schedule ADD ci 
to read the correct value on cycle 8, as shown in Figure 5—8 and Table 5-8. 


Modulo Scheduling of Multicycle Loops 


Figure 5-8. Dependency Graph of Weighted Vector Sum 
(With Resource Conflict Resolved) 


A side 


B side 


2 
SHR 
7 
pi_scaled 
1 
ADD 


TH 


Part Ill 


D 
S 
SUB 
B 


é 
Cc 


Note: Shaded numbers indicate the cycle in which the instruction is first scheduled. 
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Table 5-8. Modulo Iteration Interval Table for Weighted Vector Sum (2-Cycle Loop) 


Part Ill 


8 10 
LDW ai_i+1 | LDW ai_i+1 LDW ai_i+1 LDW ai_i+1 LDW ai_i+1 LDW ai_i+1 
LDW bi_i+1 | LDW bi_i+1 LDW bi_i+1 LDW bi_i+1 LDW bi_i+1 LDW bi_i+1 
ADD ci ADD ci 

AND bi AND bi 

SHR bi+1 SHR bi+1 

9 11 

MPY pi MPY pi MPY pi 
MPYHL pi+1 MPYHL pi+1 MPYHL pi+1 

SHR pi_s SHR pi_s SHR pi_s 
SHR pi+1_s SHR pi+1_s SHR pi+1_s 

MPY pi MPY pi 
5 ee MPYHL pi+1 | MPYHL pi+1 MPYHL pi+1 MPYHL pi+1 


Note: 
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The asterisks indicate the iteration of the loop; shading indicates changes in scheduling from Table 5—7. 
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5.5.6.4 Scheduling the Remaining Instructions 
Figure 5—9 shows the dependency graph with additional scheduling changes. 


The final version of the loop, with all instructions scheduled correctly, is shown 
in Table 5-9. 


Figure 5-9. Dependency Graph of Weighted Vector Sum (Scheduling ci+1) 


Part Ill 


Note: Shaded numbers indicate the cycle in which the instruction is first scheduled. 


Optimizing Assembly Code 5-43 


Part Ill 


Modulo Scheduling of Multicycle Loops 


5-44 


Table 5-9 shows the following additions: 


UOOUUOU 


To 


B LOOP (.S1, cycle 6) 
SUB cnir (.L1, cycle 5) 
ADD ci+1 (.L2, cycle 10) 
STH ci (cycle 9) 

STH ci+1 (cycle 11) 


avoid resource conflicts and live-too-long problems, Table 5-9 also 


includes the following additional changes: 


UU 


UUW 


LDW bi_i+1 (.D2) moved from cycle 0 to cycle 2. 

AND bi (.L2) moved from cycle 6 to cycle 7. 

SHR pi+1_scaled (.S2) moved from cycle 7 to cycle 9. 
MPYHL pi+1 moved from cycle 5 to cycle 6. 

SHR bi+1 moved from cycle 6 to 8. 


From the table, you can see that this loop is pipelined six iterations deep, be- 
cause iterations n and n + 5 execute in parallel. 


Modulo Scheduling of Multicycle Loops 


Table 5-9. Modulo Iteration Interval Table for Weighted Vector Sum (2-Cycle Loop) 


LDW ai_i+1 | LDW ai_i+1 LDW ‘al let LDW Fal lat LDW ai_i+1 LDW ai_i+1 
——_- 4 i+1 | LDW ——4 i+1 LDW — a i+1 LDW bi_i+1 LDW ——= i+1 


ae i it 
a cs 
pe || a tet 
ps | | || toor | etcor | stor 
ee ee 
a ee 
ee ee 


ff || i 


SUB cnitr SUB centr 


* kk 


AND bi AND bi 
SHR pi_s SHR pi_s 


SHR pi+1_s | SHRpi+1-s 


Note: The asterisks indicate the iteration of the loop; shading indicates changes in scheduling from Table 5-8. 
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5.5.7 Using the Assembly Optimizer for the Weighted Vector Sum 


Example 5—20 shows the linear assembly code to perform the weighted vector 
sum. You can use this code as input to the assembly optimizer to create a soft- 
ware-pipelined loop instead of scheduling this by hand. 


Example 5-20. Linear Assembly for Weighted Vector Sum 


-global _w_vec 
_w_vec: .cproc ap Dy Cy Mm 
.reg aiid, “bail; pL, pil, pial; pis; pills 
.reg task; ‘bia; bil, ei, Gil, ‘ely <entr 
MVK -1,mask , set to all 1s to create OxFFFFFFFF 
MVKH Q0,mask ; clear upper 16 bits to create OxFFFF 
MVK 50,cntr ; centr = 100/2 
ADD Pom au k ; point to c[1] 
LOOP: rip: 90 
LDW .D1 *x*at+,ai_il , al & aitl 
LDW .D2 DA bas a1 ; bi & bitl 
MPY .M1X ai_il,m,pi Foam A ceva 
MPYHL .-M2X ai_il,m,pil 7 m * aitl 
SHR ods pi,15,pi_s 3; (m * ai) >> 15 
SHR 2S2 pil,15,pil_s ; (m * ait+l) >> 15 
AND ~ G2 bi_il,mask,bij; bi 
SHR 382 by 67a 39 bt F 1 
ADD .L1X pi_s,bi,ci , @2-= (9 * ai) o> 15 4+ ba 
ADD .L2 pil_s,bil,cil; citl = (m * aitl) >> 15 + bitl 
STH Bai cL, tot [2] ; store ci 
STH .D2 cil,*cl++[2] ; store cit+l 
[entr] SUB Pee h ontiey 1, cnir ; decrement loop counter 
[centr] B .S1 LOOP ; branch to loop 
-endproc 
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5.5.8 Final Assembly 


Example 5-21 shows the final assembly code for the weighted vector sum. 
The following optimizations are included: 


(1 While iteration n of instruction STH ci+1 is executing, iteration n + 1 of 
STH ciis executing. To prevent the STH ci instruction from executing itera- 
tion 51 while STH ci + 1 executes iteration 50, execute the loop only 49 
times and schedule the final executions of ADD ci+1 and STH ci+1 after 
exiting the loop. 


[1 The mask for the AND instruction is created with MVK and MVEH in paral- 
lel with the loop prolog. 


Lj The pointer to the odd elements in array c is also set up in parallel with the 
loop prolog. 
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Example 5-21. Assembly Code for Weighted Vector Sum 


LDW é Dil: *RA++,A2 

ADD .L2X  A6,2,B0 

LDW D2 *B4++,B2 
| | LDW D1 *A4++,A2 

MVK S2 -1,B10 

LDW D2 *B4++,B2 

LDW D *A4++,A2 

VK S 49,Al 

‘VKH S2 0,B10 

PY X  A2,B6,A5 
[Al] SUB L Al,1,Al1 

PYHL 2X A2,B6,B5 
[Al] B Ss LOOP 

LDW D2 *B4++,B2 

LDW D *A4++,A2 

SHR S A5,15,A7 

AND L2 B2,B10,B8 

MPY M1X A2,B6,A5 

[Al] SUB L, Al,1,Al1 

SHR S2 B2,16,Bl 

ADD L1X  A7,B8,A9 

PYHL 2X  A2,B6,B5 
[Al] B .S1 LOOP 

LDW D2 *B4++,B2 

LDW D *R4++,A2 

SHR S2 B5,15,B7 

STH D AQ, *A6++[2] 

SHR S A5,15,A7 

AND L2 B2,B10,B8 

[Al] SUB L, Al,1,Al1 

MPY M1X A2,B6,A5 
LOOP: 

ADD L2 B7,B1,B9 
| | SHR S2 B2,16,Bl 
| | ADD .L1X A7,B8,A9 
| | MPYHL .M2X A2,B6,B5 
|| [Al] B S1 LOOP 
| | LDW D2 *B4++,B2 
| | LDW D1 *A4++,A2 


, 


’ 


v 


, 


’ 


’ 


ail & ai+l 
set pointer to citl 


bi & bitl 


;* al & ait+l 


set to all 1s (0OxFFFFFFFF) 


;* bi & bitl 


** ai & aitl 
set up loop counter 


clr upper 16 bits (Ox0O000FFFF) 
m * ai 

decrement loop counter 

m * ait+l 


branch to loop 
AX bi -& bitl 


7*** ai & aitl 


(m * ai) >> 15 
bi 


pom! aa: 


, 


, 


, 


;* decrement loop counter 


bitl 
ci = (m * ai) 
*omee* aut 


>> 15 + bi 


;* branch to loop 


7*** bi & bitl 
PRAEX at & aa hl 


, 


, 


, 


’ 


(m * aitl) >> 15 
store ci 
$28 SM, Re atae)  LS 


#7 a. 
** decrement loop counter 
*kK m * ai 


cit+l = (m * aitl) >> 15 + bitl 
* bitl 
;* ci = (m * ai) >> 15 + bi 


AO Ths rage 


;** branch to loop 


pRRES BiG iv 
PORE Ade, Ge AA EL. 
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Example 5-21. Assembly Code for Weighted Vector Sum (Continued) 


STH .D2 B9, *BO++[2] ; store citl 
I] SHR 32 B5,15,B7 7* (m * aitl) >> 15 
| | STH «DL A9, *A6++[2] .* store. cL 
I] SHR .S1 A5,15,A7 7** (m * ai) >> 15 
\ | AND .L2 B2,B10,B8 Haan oft 
| | [Al] SUB ~L1 A1l,1,Al1 ;*** decrement loop counter 
|| MPY .M1X A2,B6,A5 pee om * al 
; Branch occurs here 
ADD .L2 B7,B1,B9 ; Ccitl = (m * ait+l) >> 15 + bitl 
STH «D2 BY, *BO ; store citl 
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5.6 Loop Carry Paths 


Loop carry paths occur when one iteration of a loop writes a value that must 
be read by a future iteration. A loop carry path can affect the performance of 
a software-pipelined loop that executes multiple iterations in parallel. Some- 
times loop carry paths (instead of resources) determine the minimum iteration 
interval. 


IIR filter code contains a loop carry path; output samples are used as input to 
the computation of the next output sample. 
5.6.1 IIR Filter C Code 


Example 5-22 shows C code for a simple IIR filter. In this example, y[i] is an 
input to the calculation of y[i+1]. Before y[i] can be read for the next iteration, 
y[i+1] must be computed from the previous iteration. 


Example 5-22. IIR Filter C Code 


void iir(short x[],short y[],short cl, short c2, short c3) 
Int. 1 


for (i=0; i<100; i++) { 
y[itl] = (cl*x[i] + c2*x[itl] + c3*y[i]) >> 15; 
} 


5-50 


Loop Carry Paths 


5.6.2 Translating C Code to Linear Assembly (Inner Loop) 


Example 5-23 shows the ’C62xx instructions that execute the inner loop of the 
IIR filter C code. In this example: 


Lj xptr is not postincremented after loading xi+1, because xi of the next 
iteration is actually xi+1 of the current iteration. Thus, the pointer points to 
the same address when loading both xi+1 for one iteration and xi for the 
next iteration. 


(“1 yptr is also not postincremented after storing yi+1, because yi of the next 
iteration is yi+1 for the current iteration. 


Example 5-23. Linear Assembly for IIR Inner Loop 


LDH *xptrtt+, xi ¢ xitl 

MPY cl,xi,p0 ao Cle A) Sea: 

LDH MEGOIC TS, o<aese IL Owes itapal 

MPY c2,xitl,pl eC Z  -xctal 

ADD p0,pl,s0 Sh RS A a 
LDH MVAOIE Tesh apy Wal ves 


MPY c3,yi,p2 PGS. yh 
ADD s0,p2,sl poCL Fx eZ * RL. Bes * “ven 
SHR s1,15,yitl ; yitl 
S 
S 
B 


TH yitl,*yptr ; store yitl 
UB entr,1,¢ntr ; decrement loop counter 
LOOP 7 branch to loop 
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5.6.3 Drawing a Dependency Graph 


Figure 5-10 shows the dependency graph for the IIR filter. A loop carry path 
exists from the store of yi+1 to the load of yi. The path between the STH and 
the LDH is one cycle because the load and store instructions use the same 
memory pipeline. Therefore, if a store is issued to a particular address on cycle 
nand a load from that same address is issued on the next cycle, the load reads 
the value that was written by the store instruction. 


Figure 5-10. Dependency Graph of IIR Filter 


A side B side 
LDH LDH LDH 


Part Ill 


Note: The shaded numbers show the loop carry path:5+2+1+1+1+=10. 
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5.6.4 Determining the Minimum Iteration Interval 


To determine the minimum iteration interval, you must consider both resources 
and data dependency constraints. Based on resources in Table 5—10, the 
minimum iteration interval is 2. 


Note: 


There are six non-.M units available: three on the A side (.S1, .D1, .L1) and 
three on the B side (.S2, .D2, .L2). Therefore, to determine resource 
constraints, divide the total number of non-.M units used on each side by 3 
(3 is the total number of non-.M units available on each side). 


Based on non-.M unit resources in Table 5-10, the minimum iteration inter- 
val for the IIR filter is 2 because the total non-.M units on the A side is 5 (5 = 3 
is greater than 1 so you round up to the next whole number). The B side uses 
only three non-.M units, so this does not affect the minimum iteration interval, 
and no other unit is used more than twice. 


|) 


Table 5-10. Resource Table for IIR Filter 


(a) A side 
Unit(s) 

-M1 

S1 

.D1 

.L1,.S1, or .D1 


Total non-.M units 


(b) B side 
Instructions Total/Unit | Unit(s) Instructions Total/Unit 
2 MPYs 2 | M2 MPY 1 
B 1 .S2 SHR 1 
2 LDHs 2 .D2 STH 1 
ADD & SUB 2 .L2 or .S2,.D2 ADD 1 
5 Total non-.M units 3 


However, the IIR has a data dependency constraint defined by its loop carry 
path. Figure 5-10 shows that if you schedule LDH yi on cycle 0: 


Lj The earliest you can schedule MPY p2 is on cycle 5. 


The earliest you can schedule ADD s1 is on cycle 7. 


LI 
[1 SHR yi+1 must be on cycle 8 and STH on cycle 9. 
LI 


Because the LDH must wait for the STH to be issued, the earliest the the 
second iteration can begin is cycle 10. 


To determine the minimum loop carry path, add all of the numbers along the 
loop paths in the dependency graph. This means that this loop carry path is 
10(6+2+1+1+41). 
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Although the minimum iteration interval is the greater of the resource limits and 
data dependency constraints, an interval of 10 seems slow. Figure 5—11 
shows how to improve the performance. 


5.6.4.1 Drawing a New Dependency Graph 


Figure 5-11 shows anew graph with a loop carry path of 4 (2 +1 + 1). because 
the MPY p2 instruction can read yi+1 while itis still in a register, you can reduce 
the loop carry path by six cycles. LDH yi is no longer in the graph. Instead, you 
can issue LDH y[0] once outside the loop. In every iteration after that, the y+1 
values written by the SHR instruction are valid y inputs to the MPY instruction. 


Figure 5—11.Dependency Graph of IIR Filter (With Smaller Loop Carry) 


B side 


A side 


Part Ill 


Note: The shaded numbers show the loop carry path:2+1+1=4. 
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Loop Carry Paths 


Example 5—24 shows the new linear assembly from the graph in Figure 5—11, 
where LDH yi was removed. The one variable y that is read and written is yi 
for the MPY p2 instruction and yi+1 for the SHR and STH instructions. 


Example 5-24. Linear Assembly for IIR Inner Loop With Reduced Loop Carry Path 


LDH *xptrt+, xi : 
MPY cl,xi,p0 ; 
LDH *xptr,xitl : 
MPY c2,xitl1,pl ; 
ADD p0,pl,s0 ; 
MPY E3192 6 
ADD s0,p2,sl : 
SHR sil Loy ‘ 
STH Wp SWS IAS 5 
{cntr] SUB ener, 1, cntr ; 
{centr]B LOOP : 


x11 
el * 
x1+1 
G2. * 
on 
Os) 
ed OX 
yitl 


store yit 
decrement loop counter 


Ka ee DS RS C3 Rye. 


b .62i- 8 each 


branch to loop 


5.6.5 Allocating Resources 


Example 5-25 shows the same linear assembly instructions as those in 
Example 5—24 with the functional units and registers assigned. 


Example 5-25. Linear Assembly for IIR Inner Loop (With Allocated Resources) 


LDH .D1 *A4++,A2 
MPY .M1 A6,A2,A5 
LDH .D1 *R4, D3 
MPY .M1X  B6,A3,A7 
ADD Rail A5,A7,A9 
MPY .M2X  A8,B2,B3 
ADD .L2X  B3,A9,B5 
SHR .S2 B5,15,B2 
STH .D2 B2,*B4++ 
[Al] SUB Paral Al,1,Al 
Al] B .S1 LOOP 


; xid 
Cl 
; xid 
2 
el 
P63 
aon 
yitl 
store yitl 

decrement loop counter 
branch to loop 


+1 
* 
+1 


* 


* 


* 


* 


xi 


xitl 


xi + c2 * xitl 


yi 


SL oP C2 ELS + eB Koya: 
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5.6.6 Modulo Iteration Interval Scheduling 


Table 5—11 shows the mod 
instruction on cycle 10 finis 


ulo iteration interval table for the IIR filter. The SHR 
hes in time for the MPY p2 instruction from the next 


iteration to read its result on cycle 11. 


Table 5-11. Modulo Iteration Interval Table for IIR (4-Cycle Loop) 
Di res LDH xi LDH xi 4 LDH xi | ton xist | LOH cia 
D2 ee | D2 
M1 M1 MPY po : 


Faerie. oto 


St St 
$2 $2 
1X 1X 
| 2x | | | 2x | | ADD s1 
| 6 | 10, 14, 18, ... | Unit/Cycle 11, 15, 19, ... 
‘D1 
D2 | D2 STH yi+1 
if ie 
L2 | L2 
Note: The asterisks indicate the iteration of the loop. 
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5.6.7 Using the Assembly Optimizer for the IIR Filter 


Example 5—26 shows the linear assembly code to perform the IIR filter. Once 
again, you can use this code as input to the assembly optimizer to create a soft- 
ware-pipelined loop instead of scheduling this by hand. 


Example 5-26. Linear Assembly for IIR Filter 


-global _iir 
itr: .eproec x, “v7; ly -c2,-:63 
.reg xl, xil, yil 
.reg pO, pl, p2, sO, sl, cntr 
MVK 100,cntr ; centr = 100 
LDH D2 *ytt+,yil 7 yitl 
LOOP: trip 100 
LDH alr Seb e sed. Faee cal 
MPY -M1l cl,xi,p0 Cal. Use: 
LDH (Dd. Ae, ELL > sit] 
MPY -M1X c2,xil,pl 2.2 Se xa 
ADD -L1 p0,pl,s0 >; cl * xi + c2 * xitl 
MPY -M2X c3,yil,p2 we JOSS Reeve a 
ADD -L2X s0,p2,s1 Pan 7 eS cai nn OF Ammaat ee. a | is eae OG Samm ce 
SHR -S2 s1,15,yil 7 yitl 
STH -D2 yil,*yt+t+ ; store yitl 
{cntr] SUB el. sentir, 2,-cntx decrement loop counter 
{cntr] B -S1 LOOP branch to loop 
endproc 


Optimizing Assembly Code 
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5.6.8 Final Assembly 


Example 5—27 shows the final assembly for the IIR filter. With one load of y[0] 
outside the loop, no other loads from the y array are needed. Example 5-27 
requires 408 cycles: (4x 100) + 8. 


Example 5-27. Assembly Code for IIR Filter 


LDH -D1 *AR4++,A2 POEL 
LDH Di *A4,A3 ¢ i+] 
LDH <2 *B4++,B2 ; load y[0] outside of loop 
MVK Jo 100,Al1 ; set up loop counter 
LDH (Del: *A4++,A2 ;* xi 
[Al] SUB eid Al,1,Al1 ; decrement loop counter 
|| MPY -M1 A6,A2,A5 PCL RE 
|| LDH .D1 *A4,A3 ;* xitl 
MPY .M1X  B6,A3,A7 3 C2 * xitl 
| | [Al] B “ok LOOP ; branch to loop 
MPY -M2X A8,B2,B3 RCS yr 
LOOP: 
ADD -L1 A5,A7,A9 PSG TR see C2 Raed: 
LDH .D1 *DA4++,A2 pee xi 
ADD ~L2X B3,A9,B5 Fuel Feb 62 ad ee 3" F ya 
[Al] SUB -L1 Al1,1,Al1 7* decrement loop counter 
MPY -M1 A6,A2,A5 FOOL RS 
LDH 2D: *A4,A3 p** xitl 
SHR ~S2 B5,15,B2 7; yitl 
MPY .M1X  B6,A3,A7 3* c2 * xitl 
[Al] B Robi LOOP 7* branch to loop 
STH -D2 B2, *B4++ ; store yit+l 
MPY ~M2X A8,B2,B3 Pe C3 ya: 
; Branch occurs here 
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5.7 lf-Then-Else Statements in a Loop 


5.7.1 


If-then-else statements in C cause certain instructions to execute when the if 
condition is true and other instructions to execute when it is false. One way to 
accomplish this in linear assembly code is with conditional instructions. be- 
cause all ’C62xx instructions can be conditional on one of five general-purpose 
registers, conditional instructions can handle both the true and false cases of 
the if-then-else C statement. 


If-Then-Else C Code 


Example 5—28 contains a loop with an if-then-else statement. You either add 
a[i] to sum or subtract a[i] from sum. 


Example 5-28. If-Then-Else C Code 


int if_then(short a[], int codeword, int mask, short theta) 


{ 


int i,sum, cond; 


sum = 0; 


for 


return (sum); 


} 


(= 
cond 
if ( 


else 


mask 


} 


Oe A Se Bee Lt g 

= codeword & mask; 
theta == !(!(cond))) 
sum += al[il]; 


sum -= a[i]; 
= mask << 1; 


Branching is one way to execute the if-then-else statement: branch to the ADD 
when the if statement is true and branch to the SUB when the if statement is 
false. However, because each branch has five delay slots, this method 
requires additional cycles. Furthermore, branching within the loop makes soft- 
ware pipelining almost impossible. 


Using conditional instructions, on the other hand, eliminates the need to 
branch to the appropriate piece of code after checking whether the condition 
is true or false. Simply program both the ADD and SUB as usual, but make 
them conditional on the zero and nonzero values of a condition register. This 
method also allows you to software pipeline the loop and achieve much better 
performance than you would with branching. 
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5.7.2 Translating C Code to Linear Assembly 


Example 5—29 shows the linear assembly instructions needed to execute in- 
ner loop of the C code in Example 5-28. 


Example 5-29. Linear Assembly for If-Then-Else Inner Loop 


AND codeword, mask, cond 
cond] MVK 1,cond 

CMPEQ theta,cond,if 

LDH xaptrt+t+,ai 
if] ADD sum, ai,sum 
!'i£] SUB sum, ai,sum 

SHL mask,1,mask 
centr] ADD =1,.cntr,cnatr 
cntr]B LOOP 


ee ee TT) 


cond = codeword & mask 
!(! (cond) ) 

(theta == !(!(cond))) 
a[i] 

sum += ali] 

sum —-= ali] 

mask = mask << 1; 


decrement counter 
for LOOP 


CMPEQ is used to create IF. The ADD is conditional when IF is nonzero (corre- 
sponds to then); the SUB is conditional when IF is 0 (corresponds to else). 


A conditional MVK performs the !(!(cond)) C statement. If the result of the 
bitwise AND is nonzero, a 1 is written into cond; if the result of the AND is 0, 


cond remains at 0. 
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5.7.3. Drawing a Dependency Graph 


Figure 5-12 shows the dependency graph for the if-then-else C code. This 
graph illustrates the following arrangement: 


1 Two nodes on the graph contain sum: one for the ADD and one for the 
SUB. Because some iterations are performing an ADD and others are 
performing a SUB, each of these nodes is a possible input to the next itera- 
tion of either node. 


[J] The LDH ai instruction is a parent of both ADD sum and SUB sum, be- 
cause both instructions read ai. 


J CMPEQ if is also a parent to ADD sum and SUB sum, because both read 
IF for the conditional execution. 


1 The result of SHL mask is read on the next iteration by the AND cond 
instruction. 


Figure 5-12. Dependency Graph of If-Then-Else Code 


A side 
SHL 


B side 
AND 
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5.7.4 Determining the Minimum Iteration Interval 


With nine instructions, the minimum iteration interval is at least 2, because a 
maximum of eight instructions can be in parallel. Based on the way the depen- 
dency graph in Figure 5—12 is split, five instructions are on the A side and four 
are on the B side. Because none of the instructions are MPYs, all instructions 
must go on the .S, .D, or .L units, which means you have a total of six 
resources. 


(_j LDH must be on a.D unit. 

[1 SHL, B, and MVK must be on a.S unit. 

[1 The ADDs and SUB can be on the .S, .L, or .D units. 
[1 The AND can be ona.S or .L unit. 


From Table 5-12, you can see that no one resource is used more than two 
times, so the minimum iteration interval is still 2. 


Table 5-12. Resource Table for If-Then-Else Code 


(a) A side 


Unit(s) 
M1 
S1 
.D1 


.L1, .S1, or .D1 


Total non-.M units 
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(b) B side 

Instructions Total/Unit | Unit(s) Instructions Total/Unit 
0 | M2 0 
SHL &B 2 .S2 MVK 1 
LDH 1 -L2 CMPEQ 1 
ADD & SUB 2 -L2 or .S2 AND 1 
.L2,.S2,or.D2 ADD 1 
5 Total non-.M units 4 


The minimum iteration interval is also affected by the total number of instruc- 
tions. Because three units can perform nonmultiply operations on a given side, 
a total of five instructions can be performed with a minimum iteration interval 
of 2. Because only four instructions are on the B side, the minimum iteration 
interval is still 2. 
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5.7.5 Allocating Resources 


Now that the graph is split and you know the minimum iteration interval, you 
can allocate functional units and registers to the instructions. You must ensure 
that no resource is used more than twice. 


Example 5-30 shows the linear assembly with the functional units and regis- 
ters that are used in the inner loop. 


Example 5-30. Linear Assembly for Full If-Then-Else Code 


-global _if_then 
_if_then: .cproc a, cword, mask, theta 

.reg cond, if, ai, sum, cntr 

MVK 32,entr y entr = 32 

ZERO sum ; sum = 0 

LOOP: eERmip: 32 

AND ~S2X cword,mask,cond; cond = codeword & mask 
{cond MVK SZ 1,cond ; !(! (cond) ) 

CMPEQ ~L2 theta,cond,if ; (theta == !(! (cond))) 

LDH .D1 *att,ali 7; ali) 

[if ADD -L1 sum, ai,sum ; sum += a[il] 
[eae SUB ed sum, ai,sum ; sum -= a[i] 

SHL ol mask,1,mask ; mask = mask << 1; 
tentxr ADD -L2 -lLecnkr, contr ; decrement counter 
[entre B 281 LOOP ; for LOOP 

-return sum 

.endproc 
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5.7.6 Final Assembly 


Example 5-31 shows the final assembly code after software pipelining. The 
performance of this loop is 70 cycles (2 x 32 +6). 


Example 5-31. Assembly Code for If-Then-Else 


MVK ~S2 32,30 ; set up loop counter 
[BO] ADD ~L2 -1,B0,BO ; decrement counter 
{BO} ADD oe =), BO; Bo ; decrement counter 
[BO] B foal! LOOP ; for LOOP 
LDH .D1 *R4++,A5 ; ali] 
SHL -S1 Ao,1,A6 ; mask = mask << 1; 
AND ~S2X B4,A6,B2 ; cond = codeword & mask 
{[B2] MVK oe da Be ; !(! (cond) ) 
[BO] ADD ~L2 -1,B0,BO ; decrement counter 
[BO] B -S1 LOOP ;* for LOOP 
LDH .D1 *R44++,A5 7* ali] 
CMPEQ .L2 Bo6,B2,Bl1 ; (theta == !(! (cond) )) 
SHL «od A6,1,A6 7* mask = mask << 1; 
AND ~S2X B4,A6,B2 7* cond = codeword & mask 
ZERO -L1 A7 yj; zero out accumulator 
LOOP: 
{BO} ADD -L2 =1, BO, BG ; decrement counter 
[B2] MVK ie 1,B2 9% Ch teond)) 
[BO] B 7o2 LOOP 7** for LOOP 
LDH .D1 *R44++,A5 7** ali] 
{B1] ADD -L1 A7,A5,A7 ; sum += afi] 
[!B1] SUB Popa A7,A5,A7 ; sum -= a[il 
CMPEQ .L2 Bo6,B2,Bl1 7* (theta == !(! (cond))) 
SHL ok A6o,1,A6 7** mask = mask << 1; 
AND ~S2X B4,A6,B2 7** cond = codeword & mask 
; Branch occurs here 


5-64 


If-Then-Else Statements in a Loop 


5.7.7 Comparing Performance 


You can improve the performance of the code in Example 5-31 if you know 
that the loop countis at least 3. If the loop countis at least 3, remove the decre- 
ment counter instructions outside the loop and put the MVK (for setting up the 
loop counter) in parallel with the first branch. These two changes save two 
cycles at the beginning of the loop prolog. 


The first two branches are now unconditional, because the loop count is at 
least 3 and you know that the first two branches must execute. To account for 
the removal of the three decrement-loop-counter instructions, set the loop 
counter to 3 fewer than the actual number of times you want the loop to 
execute: in this case, 29 (32 — 3). 


Example 5-32. Assembly Code for If-Then-Else With Loop Count Greater Than 3 


B ok LOOP ; for LOOP 
LDH .D1 *D4++,A5 ; afi] 
MVK SZ 29,BO ; set up loop counter 
SHL moh A6,1,A6 ; mask = mask << 1; 
AND ~S2X B4,A6,B2 ; cond = codeword & mask 
[B2] MVK <o2 1,B2 , !(! (cond) ) 
B Sl LOOP ;* for LOOP 
LDH .D1 *D44++,A5 i* ali] 
CMPEQ .L2 B6,B2,Bl1 ; (theta == !(! (cond) )) 
SHL ok A6,1,A6 ;* mask = mask << 1; 
AND ~S2X B4,A6,B2 ;* cond = codeword & mask 
ZERO + a) A7 ; zero out accumulator 
LOOP: 
[BO] ADD ~L2 -1,B0,BO ; decrement counter 
{B2] MVK woe 1,B2 7* !(! (cond) ) 
{BO] B ou LOOP 7** for LOOP 
LDH .D1 *D4++,A5 i** ali] 
[B1] ADD -L1 A7,A5,A7 ; sum += a[il] 
{!B1] SUB .D1 A7,A5,A7 ; sum -= al[il] 
CMPEQ .L2 B6,B2,Bl1 7* (theta == !(! (cond) )) 
SHL osu A6,1,A6 ;** mask = mask << 1; 
AND ~S2X B4,A6,B2 ;** cond = codeword & mask 


; Branch occurs here 


Example 5-32 shows the improved loop with a cycle count of 68 (2 x 32+ 4). 
Table 5-13 compares the performance of Example 5-31 and Example 5-32. 
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Table 5-13. Comparison of If-Then-Else Code Examples 


Code Example Cycles 
Example 5-31 If-then-else assembly code (2 x 32) +6 


Example 5-32 If-then-else assembly code with loop count greater than3 (2 x 32)+4 
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Cycle Count 
70 


68 
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5.8 Loop Unrolling 


Even though the performance of the previous example is good, it can be im- 
proved. When resources are not fully used, you can improve performance by 
unrolling the loop. In Example 5—33, only nine instructions execute every two 
cycles. If you unroll the loop and analyze the new minimum iteration interval, 
you have room to add instructions. A minimum iteration interval of 3 provides 
a 25% improvement in throughput: three cycles to do two iterations, rather 
than the four cycles required in Example 5-32. 


5.8.1. Unrolled If-Then-Else C Code 


Example 5-33 shows the unrolled version of the if-then-else C code in 
Example 5-28 on page 5-59. 


Example 5-33. If-Then-Else C Code (Unrolled) 


int unrolled_if_then(short a[], int codeword, int mask, short theta) 
{ 


ink 1,sum, cond; 


sum = 0; 
for (i = 0; i < 32; it=2){ 


cond = codeword & mask; 

if (theta == !(!(cond))) 
sum += a[il]; 

else 
sum -= al[il]; 


mask = mask << 1; 


cond = codeword & mask; 

if (theta == !(!(cond))) 
sum += a[itl]; 

else 


Sel = wei [falaril |e 
mask = mask << 1; 
} 
return (sum); 


} 
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5.8.2 Translating C Code to Linear Assembly 


Example 5-34 shows the unrolled inner loop with 16 instructions and the 
possibility of achieving a loop with a minimum iteration interval of 3. 


Example 5-34. Linear Assembly for Unrolled If-Then-Else Inner Loop 


NHNPHOQ. Py 


wp 


codeword, maski, condi 
1,condi 

theta, condi,ifi 
*xaptrt+t+,ai 
sumi,ai,sumi 
sumi,ai,sumi 
maski,1,maskitl 


codeword, maskitl,conditl; 


1,conditl 

theta, conditl,ifitl 
*aptrt+t+,aitl 
sumitl,ai+1l,sumi+l 
sumitl1,ai+1l,sumi+l 
maski+1,1,maski 


=1,cntr,entr 
LOOP 


condi = codeword & maski 
!(! (condi) ) 

(theta == !(! (condi) )) 
a[i] 

sum += a[il] 

sum -= a[il] 

maski+l = maski << 1; 
condit+tl = codeword & maski+l 
!(! (conditl1) ) 

(theta == !(! (conditl1))) 
a[it! 

sum += a[itl] 

sum -= a[itl] 

maski = maskit+l << 1; 


decrement counter 
for LOOP 
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5.8.3 Drawing a Dependency Graph 


Although there are numerous ways to split the dependency graph, the main 
goal is to achieve a minimum iteration interval of 3 and meet these conditions: 


_j You cannot have more than nine non-.M instructions on either side. 
[1 Only three non-.M instructions can execute per cycle. 


Figure 5-13 shows the dependency graph for the unrolled if-then-else code. 
Nine instructions are on the A side, and seven instructions are on the B side. 


Figure 5—13. Dependency Graph of If-Then-Else Code (Unrolled) 
A side B side 


AND SHL AND 


Optimizing Assembly Code 5-69 


Part Ill 


Part Ill 


Loop Unrolling 


5.8.4 Determining the Minimum Iteration Interval 


With 16 instructions, the minimum iteration interval is at least 3 because a 
maximum of six instructions can be in parallel with the following allocation 
possibilities: 


[j LDH must be on a.D unit. 

[1 SHL, B, and MVK must be on a.S unit. 

[1 The ADDs and SUB can be ona..§, .L, or .D unit. 
[1 The AND can be ona.S or .L unit. 


From Table 5-14, you can see that no one resource is used more than three 
times so that the minimum iteration interval is still 3. 


Checking the total number of non-.M instructions on each side shows that a 
total of nine instructions can be performed with the minimum iteration interval 
of 3. because only seven non-.M instructions are on the B side, the minimum 
iteration interval is still 3. 


Table 5-14. Resource Table for Unrolled If-Then-Else Code 


(a) A side 
Unit(s) 

M1 

S1 

.D1 

.L1 

.L1 or .S1 

.L1, .S1, or .D1 


Total non-.M units 


(b) B side 

Instructions Total/Unit | Unit(s) Instructions Total/Unit 

0 | M2 0 
MVK and 2 SHLs 3 .S2 MVK and B 2 
2 LDHs 2 -L2 CMPEQ 1 
CMPEQ 1 -L2 pr.S2 AND 1 
AND 1 .L2 ,.S2, or.D2 SUB and 2ADDs 3 
ADD and SUB 2 

9 Total non-.M units 7 


5.8.5 Allocating Resources 
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Now that the graph is split and you know the minimum iteration interval, you 
can allocate functional units and registers to the instructions. You must ensure 
no resource is used more than three times. 


Example 5-35 shows the linear assembly code with the functional units and 
registers. 
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Example 5-35. Linear Assembly for Full Unrolled If-Then-Else Code 


-global _unrolled_if_then 
_unrolled_if_then: .cproc a, 
.reg cword, mask, theta, 
.reg egy e021, Sumi, 
MV A4,a 
MV B4, cword 
MV A6é,mask 
MV Bo,theta 
MVK Le, cnr 
ZERO sumi 
ZERO sumil 
LOOP: pkep: 32 
AND .-L1X cword,mask,cdi 
[cdi] MVK SL "cdi 
CMPEQ .L1iXtheta,cdi,ifi 
LDH »-D1 ¥*att+,ai 
{ifi] ADD -L1 sumi,ai,sumi 
[!ifi] SUB -D1 sumi,ai,sumi 
SHL -S1l mask,1,mask 
AND .L2X cword,mask, cdil 
[cedil MVK 282- Leedzl 
CMPEQ .L2 theta,cdil,ifil; 
LDH .D1 *at+,ail 
[ifil ADD -L2 sumil,ail,sumil ; 
(Space SUB -D2 sumil,ail,sumil ; 
SHL -S1l mask,1,mask 
[centr ADD ~D2, =15-entr,.cntr 
[entr B -S2 LOOP 
ADD sumi,sumil,sum 


.-return sum 


-endproc 


cword, mask, 


ch Ey 
sumil, sum 


theta 


pals: say als, sadide,; ener 


C callable register for lst operand 
C callable register for 2nd operand 
C callable register for 3rd operand 
C callable register for 4th operand 
centr = 32/2 

sumi = 0 

sumi+1l = 0 

cdi = codeword & maski 

!(! (cdi) ) 

(theta == !(!(cdi))) 

ali] 

sum += a[i] 

sum —-= ali] 


maski+l = maski << 1; 


cdi+l = codeword & maskitl 
'(! (cditl) ) 

(theta == !(! (cdi+l))) 
a[it+l] 

sum += a[it+l] 

sum —-= a[it+l] 


maski = maskitl << 1; 


decrement counter 
for LOOP 


Add sumi and sumitl for ret value 
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5.8.6 Final Assembly 


Example 5-36 shows the final assembly code after software pipelining. The 
cycle count of this loop is now 53: (3 x 16) + 5. 


Example 5-36. Assembly Code for Unrolled If-Then-Else 


MVK -S2 16,B0 ; set up loop counter 
LDH ya Dal *A4++,A5 focal Pa] 
[BO] ADD 72 -1,B0,BO ; decrement counter 
LDH -D1 *A4++,B5 ; afitl] 
[BO] B ~S2 LOOP ; for LOOP 
[BO] ADD sD2 -—1,B0,BO0 ; decrement counter 
SHL vol A6o,1,A6 ; maskit+l = maski << 1; 
AND -L1X B4,A6,A2 ; condi = codeword & maski 
{A2] MVK wok 1,A2 ; !(! (condi) ) 
AND ~L2X B4,A6,B2 ; conditl = codeword & maskitl 
ZERO -L1 A7 ; zero accumulator 
{[B2] MVK eS2 1,B2 ; !(! (condi+l1) ) 
| CMPEQ .L1X Bo,A2,Al1 ; (theta == !(! (condi) )) 
| SHL -S1 A6,1,A6 ; maski = maskit+l << 1; 
| LDH 2D *R44++,A5 7;* ali] 
| ZERO 2 B7 yj; zero accumulator 
LOOP 
CMPEQ .L2 Bo,B2,Bl1 ; (theta == !(! (condi+l))) 
[BO] ADD »D2 =1, BO, BO ; decrement counter 
LDH .D1 *R44++,B5 ;* a[litl] 
[BO] B ~S2 LOOP ;* for LOOP 
SHL -S1 A6,1,A6 7* maskit+tl = maski << 1; 
AND ~L1X B4,A6,A2 7* condi = codeword & maski 
Al] ADD -L1 A7,A5,A7 ; sum += a[il 
'A1]SUB rab A7,A5,A7 ; sum -= af[il 
A2] MVK 454, 1,A2 7* !1(! (condi) ) 
AND ~L2X B4,A6,B2 7* condit+tl = codeword & maski+l 
B1] ADD ~L2 B7,B5,B7 ; sum += a[itl] 
'B1]SUB -D2 B7,B5,B7 ; sum —-= a[itl] 
B2] MVK <S2 1,B2 7* 1(! (conditl1) ) 
CMPEQ .L1X Bo,A2,A1 ;* (theta == !(! (condi))) 
SHL Sek A6,1,A6 7* maski = maskitl << 1; 
LDH .D1 *R4++,A5 7** ali] 
; Branch occurs here 
ADD ~L1X A7,B7,A4 ; Move to return register 
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5.8.7 Comparing Performance 


Table 5-15 compares the performance of all versions of the if-then-else code 
examples. 


Table 5-15. Comparison of If-Then-Else Code Examples 


Code Example Cycles Cycle Count 


Example 5-31 If-then-else assembly code (2 X 32)+6 70 
Example 5-32 If-then-else assembly code with loop count greater than3 (2 x 32)+4 68 
Example 5-36 Unrolled if-then-else assembly code (3 x 16)+5 53 
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5.9 Live-Too-Long Issues 


When the result of a parent instruction is live longer than the minimum iteration 
interval of a loop, you have a live-too-long problem. Because each instruction 
executes every iteration interval cycle, the next iteration of that parent over- 
writes the register with a new value before the child can readit. Section 5.5.6.1, 
Resource Conflicts, on page 5-38 showed how to solve this problem simply 
by moving the parent to a later cycle. This is not always a valid solution. 


5.9.1 C Code With Live-Too-Long Problem 


Example 5-37 shows C code with a live-too-long problem that cannot be 
solved by rescheduling the parent instruction. Although it is not obvious from 
the C code, the dependency graph in Figure 5—14 on page 5-76 shows a split- 
join path that causes this live-too-long problem. 


Example 5-37. Live-Too-Long C Code 


{ 


} 


int live_long(short a[],short b[],short c, short d, short e) 


int i,sum0,suml1,sum,a0,a2,a3,b0,b2,b3; 
short al,bl; 


sum0O = 0; 
suml = 0; 
for (i=0; i<100; itt) { 


a0 = a[i] * c; 
al = a0 >> 15; 
a2 =al * qd; 
a3 = a2 + a0; 
sum0 += a3; 
bO = b[i] * c; 
bl = bO >> 15; 
b2 = bl * e; 
b3 = b2 + bO0; 
suml += b3; 
} 

sum = sum0 + suml; 


return (sum); 
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5.9.2 Translating C Code to Linear Assembly 


Example 5-38 shows the assembly instructions that execute the inner loop in 
Example 5-37. 


Example 5-38. Linear Assembly for Live-Too-Long Inner Loop 


LDH *xaptrt+t+,al ; load ai from memory 

LDH *bptrt+t+,bi ; load bi from memory 
PY ai,c,a0 ; aQO=ai*e 

SHR a0,15,al ; al = a0 >> 15 


a2 =al*d 
a3 = a2 + a0 


PY al,d,a2 
ADD a2,a0,a3 


ADD sum0,a3,sum0 sum0 += a3 
PY bi,c,b0 ; bO = bi*ec 


PY bl,e,b2 ;b2.= bl *-e 
ADD b2,b0,b3 7p b3°-=-b2--+ DO 


SHR b0,;,15 bl 7; bl = bO >> 15 


ADD suml,b3,suml suml += b3 
{cntr]SUB entre, 1l,entr ; decrement loop counter 
[cntr]B LOOP ; branch to loop 


5.9.3 Drawing a Dependency Graph 


Figure 5—14 shows the dependency graph for the live-too-long code. This 
algorithm includes three separate and independent graphs. Two of the inde- 
pendent graphs have split-join paths: from a0 to a3 and from b0 to b3. 
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Figure 5—14. Dependency Graph of Live-Too-Long Code 
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Split-join path 


A sid 
LDH 


e 


Split-join path 


SUB 
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5.9.4 Determining the Minimum Iteration Interval 


Table 5-16 shows the functional unit resources for the loop. Based on the re- 
source usage, the minimum iteration interval is 2 for the following reasons: 


1 No specific resource is used more than twice, implying a minimum itera- 
tion interval of 2. 


Lj A total of five non-.M units on each side also implies a minimum iteration 
interval of 2, because three non-.M units can be used on aside during each 
cycle. 


Table 5-16. Resource Table for Live-Too-Long Code 


(a) A side 
Unit(s) 

.M1 

S1 

.D1 

.L1, .S1, or .D1 


Total non-.M units 


(b) B side 
Instructions Total/Unit | Unit(s) Instructions Total/Unit 
MPY 1 | M2 MPY 1 
B and SHR 2 .S2 SHR 1 
LDH 1 .D2 LDH 1 
2 ADDs 2 .L2,.S2,or.D2 2ADDs and SUB 3 
5 Total non-.M units 5 


However, the minimum iteration interval is determined by both resources and 
data dependency. A loop carry path determined the minimum iteration interval 
of the IIR filter in section 5.6, Loop Carry Paths, on page 5-50. In this example, 
a live-too-long problem determines the minimum iteration interval. 


5.9.4.1 Split-Join-Path Problems 


In Figure 5—14, the two split-join paths from a0 to a3 and from b0 to b3 create 
the live-too-long problem. Because the ADD a3 instruction cannot be sched- 
uled until the SHR a1 and MPY a2 instructions finish, a0 must be live for at least 
four cycles. For example: 


1 IfMPY aOis scheduled on cycle 5, then the earliest SHR a1 can be schea- 
uled is cycle 7. 


Lj The earliest MPY a2 can be scheduled is cycle 8. 


Lj The earliest ADD a3 can be scheduled is cycle 10. 
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Because a0 is written at the end of cycle 6, it must be live from cycle 7 to 
cycle 10, or four cycles. No value can be live longer than the minimum iteration 
interval, because the next iteration of the loop will overwrite that value before 
the current iteration can read the value. Therefore, if the value has to be live 
for four cycles, the minimum iteration interval must be at least 4. A minimum 
iteration interval of 4 means that the loop executes at half the performance that 
it could based on available resources. 


5.9.4.2 Unrolling the Loop 


One way to solve this problem is to unroll the loop, so that you are doing twice 
as much work in each iteration. After unrolling, the minimum iteration interval 
is 4, based on both the resources and the data dependencies of the split-join 
path. Although unrolling the loop allows you to achieve the highest possible 
loop throughput, unrolling the loop does increase the code size. 


5.9.4.3 Inserting Moves 


Another solution to the live-too-long problem is to break up the lifetime of a0 
and b0 by inserting move (MV) instructions. The MV instruction breaks up the 
left path of the split-join path into two smaller pieces. 


5.9.4.4 Drawing a New Dependency Graph 


Figure 5-15 shows the new dependency graph with the MV instructions. Now 
the left paths of the split-join paths are broken into two pieces. Each value, a0 
and a0’, can be live for minimum iteration interval cycles. If MPY a0 is sched- 
uled on cycle 5 and ADD a3 is scheduled on cycle 10, you can achieve a mini- 
mum iteration interval of 2 by scheduling MV a0’ on cycle 8. Then a0 is live on 
cycles 7 and 8, and a0’ is live on cycles 9 and 10. Because no values are live 
more than two cycles, the minimum iteration interval for this graph is 2. 
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Figure 5—15. Dependency Graph of Live-Too-Long Code (Split-Join Path Resolved) 


A side 
LDH 


B side 
LDH 


5.9.5 Allocating Resources 


Example 5-39 shows the linear assembly code with the functional units as- 
signed. The choice of units for the ADDs and SUB is flexible and represents 
one of a number of possibilities. One goal is to ensure that no functional unit 
is used more than the minimum iteration interval, or two times. 


The two 2X paths and one 1X path are required because the values c, d, and 
e reside on the side opposite from the instruction that is reading them. If these 
values had created a bottleneck of resources and caused the minimum itera- 
tion interval to increase, c, d, and e could have been loaded into the opposite 
register file outside the loop to eliminate the cross path. 
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Example 5-39. Linear Assembly for Full Live-Too-Long Code 


-global _live_long 
_live_long: -cproc a, b, c, d, e 
reg ai, bi, sum0, suml, sum 
reg aOp, a_O, a_l, a_2, a_3, b_0O, 
MVK 100,cntr ; 
ZERO sum0 7 
ZERO suml ; 
LOOP: trip 100 
LDH DL *xatt,al ; 
LDH .D2 *b++,bi ; 
PY -M1 ai,c,a_0 7 
SHR .S1 a0;/1 5a. 1 ; 
PY -M1X a_l,d,a_2 ; 
V «D1 a_0,a0p ; 
ADD Pore a_2,a0p,a_3 ; 
ADD -L1 sum0,a_3, sum0 ; 
PY -M2X bi,c,b_0 
SHR sol b_0,15,b_1 
PY 2X b_1,e,b_2 
iV D2 b_0,b0p 
ADD L2 b_2,b0p,b_3 
ADD 2 LZ suml1,b_3,suml 
[centr] SUB gon entr,1,cntr 
[ented BS «Si LOOP 
ADD sum0, suml, sum : 
.-return sum 
-endproc 


HO0b;- “bully. 2b 35. Cntr 
centr = 100 
sum0 = 0 


suml = 0 


ai 
bi 
ai 
= a0 
= al 
save a0 
a3 = a2 
sum0 += 
bO = bi 
bl = bO 
b2 = bl 
save b0 
b3 = b2 
suml += 


from memory 

from memory 

ate: 

>> 15 

og 

across iterations 
+ a0 


across iterations 
+ bO 
b3 


decrement loop counter 
branch to loop 


Add sumi and sumitl for ret value 
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5.9.6 Final Assembly With Move Instructions 


Example 5—40 shows the final assembly code after software pipelining. The 
performance of this loop is 212 cycles (2 «100 + 11 + 1). 


Example 5-40. Assembly Code for Live-Too-Long With Move Instructions 


LDH ele) As *A4++,A0 ; load ai from memory 
1 | LDH .D2 *B4++,B0 ; load bi from memory 
MVK oe 100,B2 ; set up loop counter 
LDH -D1 *A4++,A0 ;* load ai from memory 
1 | LDH .D2 *B4++,BO ;* load bi from memory 
ZERO moet Al ; zero out accumulator 
{| ZERO ¥S2 Bl ; zero out accumulator 
LDH .D1 *A4++,A0 7** load ai from memory 
1 | LDH .D2 *B4++,BO ;** load bi from memory 
[B2] SUB ~S2 B2,1,B2 ; decrement loop counter 
MPY .M1 AO,A6,A3 ; aO=ai*e 
MPY ~M2X BO,A6,B10 ; bO = bi * c 
LDH DT *A4++,A0 ;*** load ai from memory 
LDH .D2 *B4++,BO ;*** load bi from memory 
[B2] SUB woe B2,1,B2 ; decrement loop counter 
{B2] B Sul LOOP ; branch to loop 
= 
= 
SHR .S1 A3,15,A5 ; al = a0 >> 15 = 
SHR <2 B10,15,B5 ; bl = bO >> 15 a 
MPY .M1 AO,A6,A3 7;* aO =ai*c 
MPY ~M2X BO,A6,B10 ;* bO = bi*ec 
LDH .D1 *A4++,A0 7**** load ai from memory 
LDH “DZ *B4++,BO 7**** load bi from memory 
MPY .M1X A5,B6,A7 ;j a2=al*ad 
MV 2DE A3,A2 ; save a0 across iterations 
MPY ~M2X B5,A8,B7 ; b2 = bl * e 
MV »DzZ B10,B8 ; save bO across iterations 
{B2] SUB “SZ B2,1,B2 7* decrement loop counter 
{B2] B Parse l LOOP 7* branch to loop 
SHR 264 A3,15,A5 ;* al = aOQ >> 15 
SHR ~S2 B10,15,B5 7* bl = bO >> 15 
MPY .M1 AO,A6,A3 7;** aOQ = ai * c 
MPY ~M2X BO,A6,B10 pret DO = bi ee 
LDH .D1 *A4++,A0 7**x*** load ai from memory 
LDH .D2 *B4++,B0 7 **x*** load bi from memory 
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Example 5-40. Assembly Code for Live-Too-Long With Move Instructions (Continued) 


LOOP: 

ADD eld A7,A2,A9 7* a3 = a2 + ad 

ADD 42 B7,B8,B9 7* b3 = b2 + bO 

MPY .M1X  A5,B6,A7 ;* aQ=al*d 

MV Para A3,A2 ;* save a0 across iterations 

MPY .M2X  B5,A8,B7 j;* b2 =bl *e 

MV «D2 B10,B8 ;* save bO across iterations 
[B2] SUB «SZ B2,1,B2 7** decrement loop counter 
[B2] B -S1 LOOP 7** branch to loop 

ADD -L1 A1,A9,Al1 ; sum0 += a3 

ADD 12 B1,B9,Bl ; suml += b3 

SHR Re Shil A3,15,A5 7° * al = ad S> 15 

SHR -S2 B10,15,B5 7** bl = bO >> 15 

MPY -M1 AO,A6,A3 PERE AO: a. AS 

MPY .M2X  BO,A6,B10 ;*** bO = bi *c 

LDH PeEae *A4++,A0 7**x**** load ai from memory 

LDH -D2 *B4++,B0 7**x**** Load bi from memory 

; Branch occurs here 

ADD wLLx Al,B1,A4 ; sum = sum0 + suml 
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5.10 Redundant Load Elimination 


Filter algorithms typically read the same value from memory multiple times and 
are, therefore, prime candidates for optimization by eliminating redundant load 
instructions. Rather than perform a load operation each time a particular value 
is read, you can keep the value in a register and read the register multiple 
times. 


5.10.1 FIR Filter C Code 


Example 5-41 shows C code for a simple FIR filter. There are two memory 
reads (x[i+j] and h[i]) for each multiply. Because the 'C62xx can perform only 
two LDHs per cycle, it seems, at first glance, that only one multiply-accumulate 
per cycle is possible. 


Example 5—41. FIR Filter C Code 


void fir(short x[], short h[], short y[]) 
{ 

int i, j, sum; 

for (j = 0; j < 100; jt+) f 


sum = 0; 
for (i = 07 i < 323 i++) 

sum += x[itj] * h[i]; 
y{j] = sum >> 15; 


One way to optimize this situation is to perform LDWs instead of LDHs to read 
two data values at a time. Although using LDW works for the h array, the x array 
presents a different problem because the ’C62xx does not allow you to load 
values across a word boundary. 


For example, on the first outer loop (j = 0), you can read the x-array elements 
(0 and 1, 2 and 3, etc.) as long as elements 0 and 1 are aligned on a 4-byte 
word boundary. However, the second outer loop (j = 1) requires reading x-array 
elements 1 through 32. The LDW operation must load elements that are not 
word-aligned (1 and 2, 3 and 4, etc.). 


5.10.1.1 Redundant Loads 


In order to achieve two multiply-accumulates per cycle, you must reduce the 
number of LDHs. Because successive outer loops read all the same h-array 
values and almost all of the same x-array values, you can eliminate the redun- 
dant loads by unrolling the inner and outer loops. 


For example, x[1] is needed for the first outer loop (x[j+1] with j = 0) and for the 
second outer loop (x{j] with j = 1). You can use a single LDH instruction to load 
this value. 
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5.10.1.2 New FIR Filter C Code 


Example 5—42 shows that after eliminating redundant loads, there are four 
memory-read operations for every four multiply-accumulate operations. Now 
the memory accesses no longer limit the performance. 


Example 5—42. FIR Filter C Code With Redundant Load Elimination 


void fir(short x[], short h[], short y[]) 
{ 
int i, j, sum0, suml; 
short x0,x1,h0,hl1; 
for (j = 0; 3 < 100; jt+=2) { 
sum0 = 0; 
suml = 0; 
x0 = x[J]; 
for (i = 0; i < 32; it+=2) { 
x1 =e [Pith]; 
ho = h[il]; 
sum0 += x0 * h0; 
suml += x1 * hO; 
xO = x[Jjt+it+2]; 
hl = h[itl]; 
sum0 += x1 * hil; 
suml += xO * hil; 
} 
yij] = sumO >> 15; 
y[jtl] = suml >> 15; 
} 
} 
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5.10.2 Translating C Code to Linear Assembly 


Example 5—43 shows the linear assembly that perform the inner loop of the 
FIR filter C code. 


Element x0 is read by the MPY p00 before it is loaded by the LDH x0 instruc- 
tion; x[j] (the first xO) is loaded outside the loop, but successive even elements 
are loaded inside the loop. 


Example 5—43. Linear Assembly for FIR Inner Loop 


LDH 
LDH 


.D2 ex 14++[2], x1 ; Xl = x[Jjtit+l] 
-D1 *h++[2],h0 ; ho = h[i] 

.M1 x0,h0,p00 se * AD 

.M1X x1,h0,p10 Po xl Ah 

Lit p00, sum0, sum0 ; sum0 += x0 * hO 
.L2X pl0,suml1, suml ; suml += xl * hO 
DA *x++[2],x0 ; xO = x[j+it2] 
.D2 *h_1+4+[2],hl ; hl = h[itl] 

.M2 x1,h1,p01 ee 

-M2X x0,h1,pll ; xO * Al 

-L1X p01,sum0, sum0 ; sum0O += xl * hl 
$Li2 pll,suml, suml ; suml += x0 * hl 
282 Ctr, ly.ctr ; decrement loop counter 
282 LOOP ; branch to loop 
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5.10.3 Drawing a Dependency Graph 


Figure 5—16 shows the dependency graph of the FIR filter with redundant load 
elimination. 


Figure 5—16. Dependency Graph of FIR Filter (With Redundant Load Elimination) 
A side | B side 
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5.10.4 Determining the Minimum Iteration Interval 


Table 5-17 shows that the minimum iteration interval is 2. An iteration interval 
of 2 means that two multiply-accumulates are executing per cycle. 


Table 5-17. Resource Table for FIR Filter Code 


(a) A side (b) B side 

Unit(s) Instructions Total/Unit | Unit(s) Instructions Total/Unit 
-M1 2 MPYs 2 | .M2 2 MPYs 2 

S1 0 .S2 B 1 

.D1 2 LDHs 2 .D2 2 LDHs 2 
.L1,.S1,or.D1 2ADDs 2 .L2, .S2, .D2 2 ADDs and SUB 3 
Total non-.M units 4 | Total non-.M units 6 

1X paths 2 | 2X paths 2 


5.10.5 Allocating Resources 


Example 5-44 shows the linear assembly with functional units and registers 
assigned. 


Example 5-44. Linear Assembly for Full FIR Code 


-global _fir 
fir: -Cproc x, h, y 
.reg sully <All «Sum, Bam, ete, ccer 
.reg p00, pOl, pl0, pill, x0, x1, hO, hl, rstx, rsth 
ADD Pigaery te ; set up pointer to h[1] 
MVK 50,octr ; outer loop ctr = 100/2 
MVK 64,rstx ; used to rst x pointer each outer loop 
MVK 64,rsth ; used to rst h pointer each outer loop 
OUTLOOP: 
ADD Bae ; set up pointer to x[j+1] 
SUB (ae le rae! ; set up pointer to h[0] 
MVK 16,ctr ; inner loop ctr = 32/2 
ZERO sum0 ; sum0 = 0 
ZERO suml ; suml = 0 
loots | SUB ootr, 1,0ctr ; decrement outer loop counter 
LDH -D1 axe [2], RO ; xO = x[j] 
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Example 5-44. Linear Assembly for Full FIR Code (Continued) 


LOOP: 


Peer] 
[etry 


[octr] 


-trip 16 


LDH 
LDH 


oiomm do 


.-endproc 


-D2 
-D1 
M1 
.M1X 
oLel 
.L2X 


-D1 
-D2 
.M2 
.M2X 
.L1X 
~L2 


-S2 
~S2 


*x  14+4+(2], x1 
*h++[2],h0 
x0,h0,p00 
x1,h0,p10 
p00, sum0, sum0 
pl0,suml1, suml 


*x++[2],x0 
*h_1+4+[2],h1 
x1,h1,p01 
x0,h1,pll 
p01, sum0, sum0 
pll,suml, suml 


etr, Lets 
LOOP 


sum0,15, sum0 
suml,15,suml 


sum0, *y+4 
suml, *y+4 
Xx, rstx,xX 


he es ths hee 


OUTLOOP 


xl = x[jtit+l1] 
ho = h[i] 

xO * hO 

xl * ho 

sum0 += x0 * hO 
suml += xl * hO 
xO = x[Jjt+it+2] 
hl = h[itl] 

xl * hil 

xO * hil 

sum0 += xl * Al 
suml += x0 * hl 


decrement loop counter 
branch to loop 

sum0 >> 15 

suml >> 15 

ylj3] = sum0 >> 15 
y[jt+1] = suml >> 15 
reset x pointer to x[j] 
reset h pointer to h[0] 
branch to outer loop 


5.10.6 Final Assembly 
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Example 5-45 shows the final assembly for the FIR filter without redundant 
load instructions. At the end of the inner loop is a branch to OUTLOOP that 
executes the next outer loop. The outer loop counter is 50 because iterations 
j and j + 1 execute each time the inner loop is run. The inner loop counter is 
16 because iterations i and i + 1 execute each inner loop iteration. 


The cycle count for this nested loop is 2352 cycles: 50 (16 x 2+9+6)+2. 


Fifteen cycles are overhead for each outer loop: 


See section 5.12, Software Pipelining the Outer Loop, on page 5-104 for in- 


J) Nine cycles execute the inner loop prolog. 
[J Six cycles execute the branch to the outer loop. 


formation on how to reduce this overhead. 


Redundant Load Elimination 


Example 5—45. Final Assembly Code for FIR Filter With Redundant Load Elimination 


MVK -S1 50,A2 ; set up outer loop counter 
MVK 2S. 80,A3 , used to rst x ptr outer loop 
1 | MVK ~S2 82,B6 ; used to rst h ptr outer loop 
OUTLOOP: 

LDH .D1 *R4++[2],A0 ; x0 = x[3] @) 
ADD ~L2X A4,2,B5 ; set up pointer to x[j+1] 
ADD *D2 B4,2,B4 , set up pointer to h[1] 
ADD -L1X B4,0,A5 ; set up pointer to h[0] 
MVK 52 16,B2 ; set up inner loop counter 

{A2] SUB ou A2,1,A2 ; Gecrement outer loop counter 
LDH -D1 *A5++[2],Al1 ; ho = h[il] @) 
LDH .D2 *B5++[2],Bl ; xl = x[jtitl 
ZERO ~L1 AY ; zero out sum0d 
ZERO ~L2 B9 ; zero out suml 
LDH .D2 *B4++[2],BO ; hl = h[itl] (3) 
LDH DL *A4++[2],A0 ; xO = x[Jj+it2 
LDH eb *A5++[2],Al1 7;* ho = h[i @) 
LDH .D2 *B5++[2],Bl 7* xl = x[jtitl] 

[B2] SUB 252 B2,1,B2 ; decrement inner loop counter 6) 
LDH .D2 *B4++[2],BO 7* hl = h[itl 
LDH .D1 *A4++[2],A0 7* xO = x[j+it2] 

[B2] B “SZ LOOP ; branch to inner loop 6) 
LDH -D1 *A54++[2],Al 7** ho = h[i] 
LDH .D2 *B5++[2],Bl 7** xl = x[jtitl] 
MPY ML AO,A1,A7 ; xO * hO @ 

{B2] SUB o2 BZ; 1.,B2 7* decrement inner loop counter 
LDH .D2 *B4++[2],BO 7** hl = h[itl] 
LDH -D1 *A4++[2],A0 ;** xO = x[Jt+it2] 
MPY M2 B1,B0,B7 ; xl * hl 
MPY -M1X B1,A1,A8 peel SO 

{B2] B “oe LOOP 7* branch to inner loop 
LDH -D1 *A5++[(2],Al1 7*** ho = h[il 
LDH .D2 *B5++[2],Bl1 7*** xl = x[Jjt+itl] 
MV AT7,A7 @) 
MPY -M2X A0O,BO,B8 pO al 
MPY -M1 AO,A1,A7 pe 30. 5 ho 

{B2] SUB Oe B2,1,B2 ;** decrement inner loop counter 
LDH .D2 *B4++[2],BO 7*** hl = h[itl] 
LDH DI *A4++[2],A0 7*** xO = x[Jt+1it2] 
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Example 5-45 Final Assembly Code for FIR Filter With Redundant Load Elimination 


’ 


outer loop branch occurs here 


(Continued) 
LOOP: 

ADD L2X A8,B9,B9 ; suml += xl * hO 
ADD Ll A7,A9,A9 , sum0O += x0 * hO 
PY 2 B1,B0O,B7 ee od oe AT 
PY 1X B1,A1,A8 ake $1 eo 

[B2] B “SZ LOOP 7** branch to inner loop 
LDH oD *A5+4+[2],Al1 7**** hO = h[i] 
LDH D2 *B5++[2],Bl p***e*e xT = x[Jt+itl] 
ADD -L1X B7,A9,A9 ; sum0 += xl * hil 
ADD ~L2 B8,B9,B9 ; suml += xO * hl 
MPY .M2X  AO,BO,B8 ;* x0 * Al 
MPY -M1 AO,Al1,A7 RAK OMe HO 

[B2] SUB ~S2 B2,1,B2 7*** decrement inner loop cntr 
LDH «D2 *B4++[2],B0O peete hd = hl[at1j 
LDH -D1 *R4++[2],A0 GREE KO! SAEZ] 
; inner loop branch occurs here 

[A2] B ok OUTLOOP ; branch to outer loop @) 
SUB veld! A4,A3,A4 ; reset x pointer to x[j] 
SUB ~L2 B4,B6,B4 ; reset h pointer to h[0] 
SHR .S1—-A9,15,A9 ; sum0 >> 15 @) 
SHR ioe BY, 15,B9 ; suml >> 15 
STH -D1 AQ, *A6++ , y[j] = sum0O >> 15 GB) 
STH -D1 BO, *A6++ ; yljtl] = suml >> 15 @) 
NOP 2 ; branch delay slots 


5-90 


Memory Banks 


5.11 Memory Banks 


The internal memory of the ’C62xx family varies from device to device. See the 
TMS320C62xx Peripherals Reference Guide to determine the memory 
spaces in your particular device. This section discusses how to write code to 
avoid memory bank conflicts. 


Most ’C62xx devices use an interleaved memory bank scheme, as shown in 
Figure 5-17. Each number in the boxes represents a byte address. A load byte 
(LDB) instruction from address 0 loads byte 0 in bank 0. A load halfword (LDH) 
from address 0 loads the halfword value in bytes 0 and 1, which are also in 
bank 0. An LDW from address 0 loads bytes 0 through 3 in banks 0 and 1. 


Because each bank is single-ported memory, only one access to each bank 
is allowed per cycle. Two accesses to a single bank in a given cycle result in 
amemory stall that halts all pipeline operation for one cycle, while the second 
value is read from memory. Two memory operations per cycle are allowed 
without any stall, as long as they do not access the same bank. 


Figure 5-17. 4-Bank Interleaved Memory 


Bank 2 Bank 3 


For devices that have more than one memory space (see Figure 5-18), an 
access to bank 0 in one space does not interfere with an access to bank 0 in 
another memory space, and no pipeline stall occurs. 
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Figure 5-18. 4-Bank Interleaved Memory With Two Memory Spaces 


Memory 
space 0 


Memory 
space 1 


Bank 0 Bank 1 Bank 2 Bank 3 


If each array in a loop resides in a separate memory space, the 2-cycle loop 
in Example 5—42 on page 5-84 is sufficient. This section describes a solution 
when two arrays must reside in the same memory space. 


Part Ill 
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5.11.1 FIR Filter Inner Loop 


Example 5-46 shows the inner loop from the final assembly in Example 5—45. 
The LDHs from the h array are in parallel with LDHs from the x array. If x[1] is 
on an even halfword (bank 0) and h[0] is on an odd halfword (bank 1), 
Example 5-46 has no memory conflicts. However, if both x[1] and h[0] are on 
an even halfword in memory (bank 0) and they are in the same memory space, 
every cycle incurs a memory pipeline stall and the loop runs at half the speed. 


Example 5—46. Final Assembly Code for Inner Loop of FIR Filter 


LOOP: 


[B2] 


.L2X A8,B9,B9 ; suml += x1 * HO 

-L1 A7,A9,A9 ; sum0 += xO * hO 

.M2 B1,BO0,B7 i omep <i Mas eae 

.M1X B1,A1,A8 ee Sol AO 

vS2 LOOP 7** branch to inner loop 
-D1 *A5++[2],Al Pai setae 010 eae oe a 

D2 *B5++[2],Bl p**ee XL = x[Jtitl] 

-L1X B7,A9,A9 ; sum0O += x1 * Al 

.L2 B8,B9,B9 ; suml += xO * Al 

.M2X AO,BO,B8 eee Oe Fa 

.M1 AO,A1,A7 7;** xO * hO 

2 B2,. 1,32 ;*** decrement inner loop cntr 
.D2 *B4++[2],BO eet searees orl ime of fei al | 

-D1 *A4+4+[2],A0 p**** XO = x[Jtit2] 


Itis not always possible to fully control how arrays are aligned, especially if one 
of the arrays is passed into a function as a pointer and that pointer has different 
alignments each time the function is called. One solution to this problem is to 
write an FIR filter that avoids memory hits, regardless of the x and h array align- 
ments. 


Part Ill 


If accesses to the even and odd elements of an array (h or x) are scheduled 
onthe same cycle, the accesses are always on adjacent memory banks. Thus, 
to write an FIR filter that never has memory hits, even and odd elements of the 
same array must be scheduled on the same loop cycle. 
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In the case of the FIR filter, scheduling the even and odd elements of the same 
array on the same loop cycle cannot be done in a 2-cycle loop, as shown in 
Figure 5-19. In this example, a valid 2-cycle software-pipelined loop without 
memory constraints is ruled by the following constraints: 


L) 
L} 
Lj 


Lj 
Lj 


LDH hO and LDH hi are on the same loop cycle. 
LDH x0 and LDH x1 are on the same loop cycle. 


MPY p00 must be scheduled three or four cycles after LDH x0, because 
it must read xO from the previous iteration of LDH x0. 


All MPYs must be five or six cycles after their LDH parents. 


No MPYs on the same side (A or B) can be on the same loop cycle. 


Figure 5-19. Dependency Graph of FIR Filter (With Even and Odd Elements of 
Each Array on Same Loop Cycle) 
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A side B side 


Note: Numbers in bold represent the cycle the instruction is scheduled on. 


The scenario in Figure 5-19 almost works. All nodes satisfy the above 
constraints except MPY p10. Because one parent is on cycle 1 (LDH h0) and 
another on cycle 0 (LDH x1), the only cycle for MPY p10 is cycle 6. However, 
another MPY on the A side is also scheduled on cycle 6 (MPY p00). Other 
combinations of cycles for this graph produce similar results. 
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5.11.2 Unrolled FIR Filter C Code 


The main limitation in solving the problem in Figure 5—19 is in scheduling a 2- 
cycle loop, which means that no value can be live more than two cycles. In- 
creasing the iteration interval to 3 decreases performance. A better solution 
is to unroll the inner loop one more time and produce a 4-cycle loop. 


Example 5—47 shows the FIR filter C code after unrolling the inner loop one 
more time. This solution adds to the flexibility of scheduling and allows you to 
write FIR filter code that never has memory hits, regardless of array alignment 
and memory space. 


Example 5-47. FIR Filter C Code (Unrolled) 


{ 


void fir(short x[], short h[], short y[]) 


int i, j, sum0, suml; 
short “0;,x1;*2;x3,n0,h1,h2, h3; 


A 


for (j = 0; j 100; j+=2) { 


sum0 = 0; 
suml = 0; 
x0 = x[J]; 


ho = h[il; 
sum0 += x0 * h0; 
suml += x1 * hO; 
x2 x[Jjt+it2]; 
hl = h[itl]; 
sum0 += x1 * hil; 
suml += x2 * hl; 
x3 x[Jj+1it+3]; 
h2 = h[it2]; 
sum0 += x2 * h2; 
suml += x3 * h2; 
x0 x[jtit4]; 
h3 = h[it+3]; 
sum0 += x3 * h3; 
suml += xO * h3; 
} 

yij] = sum0 >> 15; 

y[jtl] = suml >> 15; 
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5.11.3 Translating C Code to Linear Assembly 


Example 5—48 shows the linear assembly for the unrolled inner loop of the FIR 


filter C code. 


Example 5-48. Linear Assembly for Unrolled FIR Inner Loop 


DH 
DH 


DD 


aAx++,X1 
*ht++,ho 
x0,h0,p00 
x1,h0,p10 
p00, sum0, sum0 
pl0,suml,suml 


AXA, KZ 
*ht+t+,hl 
x1,h1,p01 
x2,hl1,pl1l1 
p01, sum0, sum0 
pll,suml,suml 


*x++,x3 
*ht++,h2 
x2,h2,p02 
x3,h2,p12 
p02, sum0, sum0 
pl2,suml,suml 


*x++,x0 
*ht++,h3 
x3,h3,p03 
x0,h3,p13 
p03, sum0, sum0 
pl3,suml,suml 


Cntr, 1 canbe 
LOOP 


x1 = x[jtit+1] 
ho = h[i] 

x0 * ho 

xl * ho 

sum0 += x0 * hO 
suml += x1 * hO 


x0 = x[j+it+4] 
= h[it3] 
x3 * h3 
* h3 
sum0 += x3 * h3 
suml += x0 * h3 


decrement loop counter 
branch to loop 
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5.11.4 Drawing a Dependency Graph 


Figure 5-20 shows the dependency graph of the FIR filter with no memory 
hits. 


Figure 5-20. Dependency Graph of FIR Filter (With No Memory Hits) 


A side B side 
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5.11.5 Linear Assembly for Unrolled FIR Inner Loop With .mptr Directive 


Example 5—49 shows the unrolled FIR inner loop with the .mptr directive. The 
.mptr directive allows the assembly optimizer to automatically determine if two 
memory operations have a bank conflict by associating memory access infor- 
mation with a specific pointer register. 


If the assembly optimizer determines that two memory operations have a bank 
conflict, then it will not schedule them in parallel. The .mptr directive tells the 
assembly optimizer that when the specified register is used as a memory point- 
er in aload or store instruction, it is initialized to point at a base location + <off- 
set>, and is incremented a number of times each time through the loop. 


Without the .mpir directives, the loads of x1 and hO are scheduled in parallel, 
and the loads of x2 and hi are scheduled in parallel. This results in a 50% 
chance of a memory conflict on every cycle. 


Example 5-49. Linear Assembly for Full Unrolled FIR Filter 


Bris 


OUTLOOP : 


Lecter] 


-global 
~cproc 
.reg 


.reg 
.reg 


Ae 

x, h, y 

Ho) hl, sume, suml,. ctr, oetr 

p00, pOl, p02, p03, p10, pll, pl2, p13 

RO, Shy, S24) RS, DO, Wl, Rey Ss. Pet. Peth 

try 2 pe ; set up pointer to h[1] 

50,octr ; outer loop ctr = 100/2 

64,rstx 7 used to rst x pointer each outer loop 
64,rsth 7 used to rst h pointer each outer loop 
*%,2,x_1 ; set up pointer to x[j+1] 

holy, ; set up pointer to h[0] 

8,ctr ; inner loop ctr = 32/2 

sum0 ; sum0 = 0 

suml ; suml = 0 

octr,1,octr ; decrement outer loop counter 

bia x+0 

SA eee 

ny h+0 

Tor, AL. aero 

~D2 Rees (2 |p 0 ; x0 = x[J] 
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Example 5—49. Linear Assembly for Full Unrolled FIR Filter (Continued) 


LOOP: -trip 8 


LDH 
LDH 


[ete |. SUB 
[Gtey -B “+82 


-D1 kx 14+4+[2],x1 
.D1 *h++[2],h0 
-M1X x0,h0,p00 
-M1 x1,h0,p10 
Pe eB p00, sum0, sum0 
-~L2X pl0,suml1, suml 
D2 *x++[2],x2 
.D2 *h_1++[2],h1 
2X x1,hl1,p01 
-M2 *2,h1,pl1l1 
~L1X p01,sum0, sum0 
~L2 pll,suml, suml 
ADL kx 14+4+[2],x3 
-D1 *h++[2],h2 
1X x2,h2,p02 
x3,h2,p12 
~L1 p02,sum0, sum0 
22x pl2,suml1, suml 
-D2 *x++[2],x0 
.D2 *h_14+4+[2],h3 
2X x3,h3,p03 
.M2 x0,h3,p13 
.~L1X p03,sum0, sum0 
22 pl3,suml1, suml 
2SZ Ctrl, cur 
LOOP 


sum0,15, sum0 
suml1,15,suml 
sum0, *y++ 
suml, *y++ 
x, rstx,xX 
h_1,rsth,h_1 
OUTLOOP 


xl = x[jtit+1] 
h[i] 


decrement loop counter 
branch to loop 


sum0 >> 15 

suml >> 15 

y[j] = sum0 >> 15 
y[jt1l] = suml >> 15 
reset x pointer to x[j] 
reset h pointer to h[0] 
branch to outer loop 
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5.11.6 Allocating Resources 


5-100 


As the number of instructions in a loop increases, assigning a specific register 
to every value in the loop becomes increasingly difficult. If 33 instructions in 
a loop each write a value, they cannot each write to a unique register because 
the ’C62xx has only 32 registers. As a result, values that are not live on the 
same cycles in the loop must share registers. 


For example, in a 4-cycle loop: 


[1 Ifavalue is written at the end of cycle 0 and read on cycle 2 of the loop, 
it is live for two cycles (cycles 1 and 2 of the loop). 


Lj Ifanother value is written at the end of cycle 2 and read on cycle 0 (the next 
iteration) of the loop, itis also live for two cycles (cycles 3 and 0 of the loop). 


Because both of these values are not live on the same cycles, they can occupy 
the same register. Only after scheduling these instructions and their children 
do you know that they can occupy the same register. 


Register allocation is not complicated but can be tedious when done by hand. 
Each value has to be analyzed for its lifetime and then appropriately combined 
with other values not live on the same cycles in the loop. The assembly opti- 
mizer handles this automatically after it software pipelines the loop. See the 
TMS320C6x Optimizing C Compiler User’s Guide for more information. 


5.11.7 Determining the Minimum Iteration Interval 


Memory Banks 


Based on Table 5—18, the minimum iteration interval for the FIR filter with no 
memory hits should be 4. An iteration interval of 4 means that two multiply/ac- 
cumulates still execute per cycle. 


Table 5-18. Resource Table for FIR Filter Code 


(a) A side (b) B side 

Unit(s) Instructions Total/Unit | Unit(s) Instructions Total/Unit 
-M1 4 MPYs 4 | .M2 4 MPYs 4 

S1 0 .S2 B 1 

.D1 4 LDHs 4 .D2 4 LDHs 4 
.L1,.S1,or.D1 4ADDs 4 .L2,.S2,or.D2 4ADDs and SUB 5 
Total non-.M units 8 | Total non-.M units 10 

1X paths 4 | 2X paths 4 


5.11.8 Final Assembly 


5.11.9 Comparing Performance 


Table 5-19. Comparison of FIR Filter Code 


Code Example 


Example 5-45 


Example 5-50 


Example 5—50 shows the final assembly to the FIR filter with redundant load 
elimination and no memory hits. At the end of the inner loop, there is a branch 
to OUTLOOP to execute the next outer loop. The outer loop counter is set to 


50 because iterations j and j+1 are executing each time the inner loop is run. 
The inner loop counter is set to 8 because iterations i,i+1,i+2,andi+3 are 


executing each inner loop iteration. 


Part Ill 


The cycle count for this nested loop is 2402 cycles. There is a rather large 
outer-loop overhead for executing the branch to the outer loop (6 cycles) and 
the inner loop prolog (10 cycles). Section 5.12 addresses how to reduce this 
overhead by software pipelining the outer loop. 


FIR with redundant load elimination 


FIR with redundant load elimination and no 


memory hits 


Cycles 


Cycle Count 


50 (16 x 2+9+6)+2 2352 


50 (8 x 44+10+6)+2 2402 
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Example 5-50. Final Assembly Code for FIR Filter With Redundant Load Elimination 


and No Memory Hits 
MVK +S 50,A2 ; set up outer loop counter 
MVK -S1 62,A3 ; used to rst x pointer outloop 
|| MVK ~S2 64,B10 ; used to rst h pointer outloop 
OUTLOOP: 
LDH .D1 *A4++,B5 ; x[Jj] 
ADD L2X A4,4,Bl1 ; set up pointer to x[j+2] 
ADD .L1X B4,2,A8 ; set up pointer to h[1] 
MVK -S2 8,B2 ; set up inner loop counter 
[A2] SUB eS A2,1,A2 ; decrement outer loop counter 
LDH .D2 *B1++[2],BO 7; x2 = x[j+it+2] 
LDH .D1 *A4++[2],A0 ; xl = x[jtitl] 
ZERO Lil AQ ; zero out sum0 
ZERO L2 BY ; zero out suml 
LDH -D1 *A8++[2],B6 ; hl = h[itl 
LDH .D2 *B4+4+[2],Al1 ; ho = h[i 
LDH .D1 *A4++[2],A5 ; x3 = x[j+it+3] 
LDH {D2 *B1++[2],B5 ; xO = x[j+it+4] 
LDH .D2 *B44++[2],A7 ; h2 = h[it2 
LDH .D1 *A8++[2],B8 ; h3 = h[it3 
[B2] SUB 2S2 B2,1,B2 ; decrement loop counter 
LDH .D2 *B1++[2],BO 7* x2 = x[j+it2] 
LDH {DL *A4++[2],A0 7* xl = x[j+itl] 
LDH DL *A8++[2],B6 7* hl = h[itl 
LDH .D2 *B4+4+[2],Al1 7;* ho = h[i 
MPY -M1X B5,A1,A0 ; xO * hO 
MPY .M2X A0,B6,B6 j; xl * hl 
LDH .D1 *A4++[2],A5 7* x3 = x[j+it3] 
LDH {D2 *B1++[2],B5 7* xO = x[j+it4] 
{B2] B .S1 LOOP ; branch to loop 
MPY .M2 BO,B6,B7 SD 
MPY -M1 AO,A1,A1 Pome no 
LDH .D2 *B4++[2],A7 7* h2 = h[it2] 
LDH {DL *A8++[2],B8 7* h3 = A[it3] 
[B2] SUB <o2 B2,1,B2 7* decrement loop counter 
ADD Lid: AO,A9,A9 ; sum0O += xO * hO 
MPY -M2X A5,B8,B8 ty eS 
MPY .M1X BO,A7,A5 ; x2 * h2 
LDH ~D2 *B1++[2],BO 7** x2 = x[Jt+it2] 
LDH .D1 *A4++[2],A0 7** xl = x[Jjtitl] 
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Example 5-50. Final Assembly Code for FIR Filter With Redundant Load Elimination 


and No Memory Hits (Continued) 


LOOP: 


[B2] 
[B2] 


[B2] 
[B2] 


[B2] 


[B2] 
[B2] 
[B2] 


[B2] 
[B2] 


ADD L2X 
ADD L1xX 
MPY VA 
MPY 1 
LDH D1 
LDH D2 
ADD L2 
ADD L1 
MPY 1X 
MPY 2X 
LDH D1 
LDH D2 
ADD -L2X 
ADD -L1X 
B $1 
MPY 2 
MPY 1 
LDH D2 
LDH D1 
SUB S2 
ADD ~L2 
ADD sud: 
MPY -M2X 
MPY .M1X 
LDH sDZ 
LDH -D1 


; inner loop branch occurs here 


B -S2 
SUB -L1 
SUB ~L2 
SUB S 
SHR S 
SHR S2 
STH D 
STH D 
NOP 2 


; outer loop branch occurs here 


Al,B9,B9 
B6,A9,A9 
B5,B8,B7 
A5,A7,A7 
*A8++[2],B6 
*B4++[2],Al 


B7,B9,B9 
A5,A9,A9 
B5,A1,A0 
AO,B6,B6 
*A4++[2],A5 
ABI [2], BO 


A7,B9,B9 
B8,A9,A9 
LOOP 
BO,B6,B7 
AO,A1,Al1 
*B4++[2],A7 
*A8++[2],B8 
B2,1,B2 


B7,B9,B9 
AO,A9,A9 
A5,B8,B8 
BO,A7,A5 
*B1++[2],BO 
*A4++[2],A0 


OUTLOOP 
A4,A3,A4 
B4,B10,B4 
A9,A0,A9 


A9,15,A9 
B9,15,B9 


AQ, *A6++ 


B9, *A6++ 


; suml += x1 * HO 
; sum0O += x1 * Al 
PO) NS 

KS Ae 

7** hl = h[itl] 
7** ho = h[il 

; suml += x2 * Al 
; sum0 += x2 * h2 
Pe xOF He AO 

28 eg OF nT 

Heat saep <6) x[jtit3] 
RO x[jtit4] 
; suml += x3 * h2 
; sum0 += x3 * h3 
7* branch to loop 
me eZ oe AL 

px HO 

7** h2 = h[it2] 
7** h3 = h[it3] 


** decrement loop counter 


suml 
* sum0 
* x3 * 
* x2 * 
KKK x2 
KKK x1 


branch to outer loop 

xX pointer to x[j] 

h pointer to h[0] 

-= x0*hO (eliminate add) 


reset 
reset 
sum0 


sum0 
suml 


yj] 
y[jtl 


branc 


+= x0 * h3 
+= x0 * hO 
h3 
h2 
= x[jtit2] 
= x[jtitl] 


eel le 
ee TS 


= sum0 >> 15 
] = suml >> 15 


h delay slots 
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5.12 Software Pipelining the Outer Loop 
In previous examples, software pipelining has always affected the inner loop. 
However, software pipelining works equally well with the outer loop ina nested 
loop. 


5.12.1 Unrolled FIR Filter C Code 


Example 5—51 shows the FIR filter C code after unrolling the inner loop (identi- 
cal to Example 5—47 on page 5-95). 


Example 5-51. Unrolled FIR Filter C Code 


void fir(short x[], short h[], short y[]) 


{ 
int i, j, sum0, suml; 
short x0,x1,x2,x3,h0,h1,h2,h3; 
for (3 = 0; j < 100; 
sum0O = 0; 
suml = 
x0 = x[Jl]; 
LAr. IL: 


jt=2) { 


= sum0 >> 15; 
suml >> 15; 


y[3] 
y[jt+1] = 
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5.12.2 Making the Outer Loop Parallel With the Inner Loop Epilog and Prolog 


The final assembly code for the FIR filter with redundant load elimination and 
no memory hits (Shown in Example 5—50 on page 5-102) contained 16 cycles 
of overhead to call the inner loop every time: ten cycles for the loop prolog and 
six cycles for the outer loop instructions and branching to the outer loop. 


Most of this overhead can be reduced as follows: 


_j} Put the outer loop and branch instructions in parallel with the prolog. 
[1 Create an epilog to the inner loop. 
_} Put some outer loop instructions in parallel with the inner-loop epilog. 


5.12.3 Final Assembly 


Example 5—52 shows the final assembly for the FIR filter with a software-pipe- 
lined outer loop. Below the inner loop (starting on page 5-107), each instruction 
is marked in the comments with an e, p, or o for instructions relating to epilog, 
prolog, or outer loop, respectively. 


The inner loop is now only run seven times, because the eighth iteration is 
done in the epilog in parallel with the prolog of the next inner loop and the outer 
loop instructions. 
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Example 5-52. Final Assembly Code for FIR Filter With Redundant Load Elimination and 
No Memory Hits With Outer Loop Software-Pipelined 


MVK ol 
STW D2 
MVK sl 
MVK S2 
ADD L2X 
LDH -D1 
ADD .L2X 
ADD - LIX 
MVK $2 
[A2] SUB Ss 
LDH -D2 
LDH -D1 
ZERO cali: 
ZERO L2 
LDH D 
LDH D2 
LDH D 
LDH D2 
OUTLOOP: 
LDH -D2 
LDH -D1 
[B2] SUB <o2 
LDH SDZ 
LDH D 
LDH -D1 
LDH D2 
MPY M1X 
MPY M2X 
LDH D 
LDH D2 
[B2] B 751. 
MPY M2 
MPY M 
LDH D2 
LDH D 
[B2] SUB S2 


50,A2 


Bll, *B15-- 
74,A3 
72,B10 
A6,2,B11 


*R4++,B8 
A4,4,Bl 
B4,2,A8 
8,B2 
A2,1,A2 


*B1++[2],BO 
*A4++[2],A0 
AQ 
BO 


*A8++[2],B6 
*B44+4+ 


N 
> 
i“ 


*A4++[2],A5 
*B1++ 


NO 
w 
oO 


*B4++[2],A7 
*A8++[2],B8 


B2,2,B2 
*B1++[2],B0 
*A4++[2],A0 


*A8++[2],B6 
*B4++[2],Al1 


B8,Al1,A0 
AO,B6,B6 
*R4++[2],A5 
*B1++[2],B5 


LOOP 
BO,B6,B7 
AO,A1,Al 
*B4++[2],A7 
*A8++[2],B8 
B2,1,B2 


’ 


v 


set up outer loop counter 


push register 

used to rst x ptr outer loop 
used to rst h ptr outer loop 
set up pointer to y[1] 


x0 = 


x 


[5] 


set up pointer to x[jt2] 

set up pointer to h[1] 

set up inner loop counter 
decrement outer loop counter 


h2 = 
h3 = 


h 
h 


1+2] 
1+3] 


decrement loop counter 


* x2 = x[jt+it+2] 
* xl = x[jtitl] 
* hl = h[i¢l 
*hO = hfi 

x0 * ho 

xl * hl 
* x3 = x[jtit3] 


* x0 = x[j+it4] 


branch to loop 
42> oA 

xl * ho 

* h2 = h[it2] 

* h3 = 
* decrement loop counter 


h[i+3] 
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Example 5-52. Final Assembly Code for FIR Filter With Redundant Load Elimination and 
No Memory Hits With Outer Loop Software-Pipelined (Continued) 


ADD -L1 AO,A9,A9 ; sumO += x0 * hO 
| MPY -M2X A5,B8,B8 a XB. * hs 
| MPY -M1X BO,A7,A5 $3 De EA? 
| LDH ~D2 *B1++[2],BO PRR SD, x[j+it2] 
| LDH -D1 *A4++[2],A0 ek Sed] x[jtitl] 
LOOP 
ADD ~L2X A1l,B9,B9 , suml += xl * hO 
ADD .L1X B6,A9,A9 ; sum0 += x1 * Al 
MPY .M2 B5,B8,B7 ; x0 * h3 
MPY .M1 A5,A7,A7 Foes) AZ 
LDH .D1 *A8++[2],B6 7** hl = h[itl] 
LDH .D2 *B4++[2],Al 7;** ho = h[i] 
ADD -L2 B7,B9,B9 , suml += x2 * hl 
ADD ~L1 A5,A9,A9 ; sum0 += x2 * h2 
MPY -M1X B5,A1,A0 ee ROU AO 
MPY .M2X AO,B6,B6 ee LO AT 
LDH -D1 *A4+4+[2],A5 pK 53 x[j+it3] 
LDH .D2 *B1++[2],B5 pee 20 x[j+it4] 
ADD ~L2X A7,B9,B9 , suml += x3 * h2 
ADD .L1X B8,A9,A9 ; sum0 += x3 * h3 
{B2] B prow LOOP 7* branch to loop 
MPY .M2 BO,B6,B7 poke De aL. 
MPY -M1 AO,A1,A1 eet oe) AO 
LDH -D2 *B4++[2],A7 7** h2 = h[it2] 
LDH -D1 *A8++[2],B8 ;** h3 = h[it3] 
{B2] SUB SZ B2,1,B2 7** decrement loop counter 
ADD ~L2 B7,B9,B9 ; suml += x0 * h3 
ADD ~L1 AO,A9,A9 ;* sumO += x0 * hO 
MPY -M2X A5,B8,B8 oe 325% hs 
MPY .M1X BO,A7,A5 Bo SD aA? 
LDH .D2 *B1++[2],BO 7*** x2 = x[Jt+1it2] 
LDH sbi *A4++[2],A0 7*** xl = x[Jjt+itl] 
; inner loop branch occurs here 
ADD .L2X A1l,B9,B9 7e suml += xl * ho 
1 | ADD ~L1X B6,A9,A9 7;e sumO += xl * hl 
1 | MPY .M2 B5,B8,B7 ;e xO * 3 
lI MPY -M1 A5,A7,A7 7e x3 * h2 
1 | SUB eb A4,A3,A4 7O reset x pointer to x[j] 
1 | SUB .D2 B4,B10,B4 7o reset h pointer to h[0] 
| | [A2] B ol OUTLOOP 7O branch to outer loop 
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Example 5-52. Final Assembly Code for FIR Filter With Redundant Load Elimination and 
No Memory Hits With Outer Loop Software-Pipelined (Continued) 


ADD .D2 B7,B9,B9 7e suml += x2 * hl 
ADD ~L1 A5,A9,A9 7e sumO += x2 * h2 
LDH pol *R4++,B8 7p x0 = x[Jj] 
ADD »L2X A4,4,B1 7oO set up pointer to x[jt+2] 
ADD -S1X B4,2,A8 7oO set up pointer to h[1] 
MVK *S2 8,B2 7° set up inner loop counter 
ADD .L2X A7,B9,B9 7e suml += x3 * h2 
ADD .L1X B8,A9,A9 7e sum0O += x3 * h3 
LDH .D2 *B1++[2],B0O 7p x2 = x[j+it2] 
LDH .D1 *A4++[2],A0 7p xl = x[jtitl] 
[A2] SUB .S1 A2,1,A2 70 decrement outer loop counter 
ADD -L2 B7,B9,B9 ;e suml += x0 * h3 
SHR -S1 A9,15,A9 7e sum0 >> 15 
LDH Da *A8++[2],B6 7p hl = hA[itl] 
LDH .D2 *B4++[2],Al 7p hO = hfi 
SHR ~S2 B9,15,B9 7;e suml >> 15 
LDH ZDsL *A4++[2],A5 7p x3 = x[jtit3] 
LDH .D2 *B1++[2],B5 7p xO = x[j+it4] 
STH Dy AY, *A6+4+[2] 7e y[j] = sumO >> 15 
STH -D2 Bo, *Bll++[2] 7e y[jt1l] = suml >> 15 
ZERO sol AQ 7;O zero out sum0 
ZERO -82 BY 7;O zero out suml 
; outer loop branch occurs here 


5.12.4 Comparing Performance 


The improved cycle count for this loop is 2006 cycles: 50 ((7 x 4) + 6 + 6) + 6. The 
outer-loop overhead for this loop has been reduced from 16 to 8 (6 + 6 — 4); 
the —4 represents one iteration less for the inner-loop iteration (seven instead 
of eight). 


Table 5-20. Comparison of FIR Filter Code 


Code Example Cycles Cycle Count 

Example 5-45 FIR with redundant load elimination 50 (16 x 24+9+6)+2 2352 

Example 5-50 FIR with redundant load elimination andno memory 50 (8 x 4+ 10+6)+2 2402 
hits 

Example 5-52 FIR with redundant load elimination and no memory 50(7 x 4+6+6)+6 2006 


hits with outer loop software-pipelined 
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5.13 Outer Loop Conditionally Executed With Inner Loop 


Software pipelining the outer loop improved the outer loop overhead in the 
previous example from 16 cycles to 8 cycles. Executing the outer loop condi- 
tionally and in parallel with the inner loop eliminates the overhead entirely. 


5.13.1 Unrolled FIR Filter C Code 


Example 5-53 shows the same unrolled FIR filter C code that used in the 
previous example. 


Example 5-53. Unrolled FIR Filter C Code 


void fir(short x[], short h[], short y[]) 
{ 

int i, j, sum0, suml; 

shore “0, x1,%2,x3,n0,h1,;h2, h3s 


for (j = 0; 3 < 100; 43+=2) { 


sum0O = 0; 

suml = 0; 

x0 = x[Jl; 

for (i = 0; i < 32; it=4){ 
xl = x[jtit+l]; 
ho = h[il]; 
sum0O += xO * hO; 
suml += x1 * hO; 
mo x[Jjtit2]; 
hl = h[itl]; 
sum0 += x1 * hil; 
suml += x2 * hl; 
x3 x[Jj+1it+3]; 
h2 = h[it2]; 
sum0 += x2 * h2; 
suml > aa aA 
x0 x[jt+it4]; 
h3 = h[it3]; 
sum0 += x3 nS? 
suml x0 * h3; 
} 

ytj] = sum0O >> 15; 

y[jtl] = suml >> 15; 
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5.13.2 Translating C Code to Linear Assembly (Inner Loop) 


Example 5—54 shows a list of linear assembly for the inner loop of the FIR filter 
C code (identical to Example 5—48 on page 5-96). 


Example 5-54. Linear Assembly for Unrolled FIR Inner Loop 


LDH e+, Sel , Xl = x[jtitl] 

LDH *ht++,ho ; ho = h[il] 

MPY x0,h0,p00 OR NO) 

MPY x1,h0,p10 poxL-* ho 

ADD p00, sum0, sum0 ; sumO0 += x0 * hO 

ADD pl0,suml,suml ; suml += xl * hO 

LDH WR KE 7 X¥2 = x[j+it2] 

LDH *ht+t+,hl ; hl = h[itl] 

MP: x1,h1,p01 fle ee hel 

MPY x2,hl1,pl1l1 eS ER oe 

ADD p01, sum0, sum0 7; sum0O += xl * hil 

ADD pll,suml,suml , suml += x2 * hl 

LDH *x++,x3 7 X38 = x[j+it+3] 

LDH *ht++,h2 ; h2 = h[it2] 

MPY x2,h2,p02 PRRZ OZ 

MPY x3,h2,p12 PoxS. * -hZ 

ADD p02, sum0, sum0 ; sumQ += x2 * h2 

ADD pl2,suml,suml , suml += x3 * h2 

LDH *x++,x0 , xO = x[j+it4] 

LDH *ht++,h3 ; h3 = h[it3] 

MPY x3,h3,p03 fs * ons 

MPY x0,h3,p13 poKO: Fos 

ADD p03, sum0, sum0 y Sum0 += #3 * ho 

ADD pl3,suml,suml ; suml += x0 * h3 
[cntr] SUB entr, 1 ).cntr ; decrement loop counter 
[cntr] B LOOP ; branch to loop 
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5.13.3 Translating C Code to Linear Assembly (Outer Loop) 


Example 5-55 shows the instructions that execute all of the outer loop func- 
tions. All of these instructions are conditional on inner loop counters. Two 
different counters are needed, because they must decrement to 0 on different 
iterations. 


_j The resetting of the x and h pointers is conditional on the pointer reset 
counter, pre. 


.j The shifting and storing of the even and odd y elements are conditional on 
the store counter, sctr. 


When these counters are 0, all of the instructions that are conditional on that 
value execute. 


(J The MVK instruction resets the pointers to 8 because after every eight 
iterations of the loop, a new inner loop is completed (8 x 4 elements are 
processed). 


(1 The pointer reset counter becomes 0 first to reset the load pointers, then 
the store counter becomes 0 to shift and store the result. 


Example 5-55. Linear Assembly for FIR Outer Loop 


[sctr 
'sctr 
'sctr 


SUB 
SHR 
SHR 
STH 
STH 
MVK 
SUB 
SUB 
SUB 
SUB 
SUB 
MVK 


yO, *y++[2] 
yl, *y_l++[2] 


petr, 1,ipetr 


sctr, lL, sctr 
sum07,15,y0 
sum17,15,yl 


dec store lp cntr 

(sum0 >> 15) 

(suml >> 15) 

yj] = (sum0 >> 15) 
y[jt1] = (suml >> 15) 
reset store lp cntr 

dec pointer reset lp cntr 


4,sctr 


X,EStx2,; x reset x ptr 

x. 1p rstxd x1 reset x_l ptr 

h,rsthl,h reset h ptr 

h_1,rsth2,h_1 reset h_1 ptr 

4,pctr reset pointer reset lp cntr 


Te 


5.13.4 Unrolled FIR Filter C Code 


The total number of instructions to execute both the inner and outer loops is 
38 (26 for the inner loop and 12 for the outer loop). A 4-cycle loop is no longer 
possible. To avoid slowing down the throughput of the inner loop to reduce the 
outer-loop overhead, you must unroll the FIR filter again. 


Example 5-56 shows the C code for the FIR filter, which operates on eight 
elements every inner loop. Two outer loops are also being processed together, 
as in Example 5-53 on page 5-109. 
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Example 5-56. Unrolled FIR Filter C Code 


void fir(short x[], short h[], short y[]) 
{ 
int i, j, sum0, suml; 
short x0,x1,x2,x3,x4,x5,x6,x7,h0,h1,h2,h3,h4,h5,h6,h7; 
for (j = 0; 3 < 100; jt=2) { 
sum0O = 0; 
suml = 0; 
x0 = x[Jl]; 
for (i = 0; i < 32; i+=8){ 
xl = x[jtit+l]; 
ho = h[il; 
sum0O += x0 * hO; 
suml += xl * hO; 
x2 = x[Jjtit+2]; 
hl = h[itl]; 
sumO += xl * hil; 
suml += x2 * hl; 
x3 = x[Jjtit+3]; 
h2 = h[it2]; 
sumO += x2 * h2; 
suml += x3 * h2; 
x4 = x[Jjt+it+4]; 
h3 = h[it3]; 
sum0 += x3 * h3; 
suml += x4 * h3; 
x5 = x[Jjtit5]; 
h4 = h[it4]; 
sumO += x4 * h4; 
suml += x5 * h4; 
x6 = x[Jjtit+6]; 
h5 = h[it5]; 
sum0O += x5 * hd; 
suml += x6 * h5; 
x7 = x[Jjtit7]; 
ho = h[it6]; 
sum0O += x6 * h6; 
suml += x7 * h6; 
xO = x[Jjt+it+8]; 
h7 = h[it7]; 
sumO += x7 * h7; 
suml += xO * h7; 
} 
y[j] = sum0 >> 15; 
y[jt1] = suml >> 15; 
} 
} 
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5.13.5 Translating C Code to Linear Assembly (Inner Loop) 


Example 5-57 shows the instructions that perform the inner and outer loops 
of the FIR filter. These instructions reflect the following modifications: 


L] 
Lj 


LDWs are used instead of LDHs to reduce the number of loads in the loop. 
The reset pointer instructions immediately follow the LDW instructions. 


The first ADD instructions for sum0 and sum1 are conditional on the same 
value as the store counter, because when sctr is 0, the end of one inner 
loop has been reached and the first ADD, which adds the previous sum07 
to p00, must not be executed. 


The first ADD for sum0 writes to the same register as the first MPY p00. 
The second ADD reads p00 and p01. At the beginning of each inner loop, 
the first ADD is not performed, so the second ADD correctly reads the 
results of the first two MPYs (p01 and p00) and adds them together. For 
other iterations of the inner loop, the first ADD executes, and the second 
ADD sums the second MPY result (p01) with the running accumulator. The 
same is true for the first and second ADDs of sum1. 
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Example 5-57. Linear Assembly for FIR With Outer Loop Conditionally Executed 


With Inner Loop 
LDW *ht++[2],h01 ; h[itdO] & h[itl 
LDW *h_1++[2],h23 ; h[it2] & h[it3 
LDW *ht++[2],h45 ; h[it4] & h[it5 
LDW -*h_1++[2],h67 ; h[it6] & h[it7 
LDW *x++[2],x01 ; x[jtit0] & x[Jjt+it+1] 
LDW EXT ba 2 eK 23 ; x[Jtit2] & x[j+it+3] 
LDW *x++[2],x45 ; x[jtit4] & x[Jj+it5] 
LDW *x 14+4+[2],x67 ; x[jtité6] & x[Jjt+it+7] 
LDH ee K8 ; x[jt+it8] 
[sctr] SUB setr, lb, sctr ; Gec store lp cntr 
[isctr] SHR sum07,15,y0 > (sum0 >> 15) 
[!'sctr] SHR sum17,15,yl ; (suml >> 15) 
[isetr] 8TH yO, *yt++[2] , ylj] = (sum0 >> 15) 
[tsetr] STH yl, *y_1++[2] 7 yljtl] = (suml >> 15) 
V x01,x0l1b 7 move to other reg file 
PYLH hO1,x01b,p10 ; plO = h[it0O]*x[j+it+l 
[sctr] ADD pl0,suml17,p10 ; suml(p10) = p10 + suml 
PYHL h01,x23,pl1l1 7; pll = h[itl]*x[j+i+2 
ADD pll1,p10,suml1l 7 suml += pll 
PYLH h23,x23,pl12 7; pl2 = h[it2]*x[j+i+3 
ADD pl2,suml1,suml12 ; suml += pl2 
PXYAL h23,x45,p13 , pl3 = h[it3]*x[ j+it+4 
ADD pl3,suml2,suml13 ; suml += pl3 
PYLH h45,x45,pl14 , pl4 = h[it4]*x[Jj+it5 
ADD pl4,sum13,suml14 ; suml += pl4 
PYHL h45,x67,p15 , pl5 = h[it5]*x[ j+it+6 
ADD pl5,suml14,sum15 ; suml += pl5 
PYLH h67,x67,p1l6 ; plé6 = h[it6]*x[j+it+7 
ADD pl6,suml15,suml16 7; suml += pl6 
PYHL h67,x8,pl17 > pl7 = h[it7]*x[j+i+8 
ADD pl7,suml6,suml17 ; suml += pl7 
PY hQ01,x01,p00 ; pOO = h[it+0]*x[ j+it+0 
[sctr] ADD p00, sum07, p00 ; sum0(p00) = p00 + sum0 
PYH hO1,x01,p01 ; pOl = h[itl]*x[Jj+itl 
ADD p01,p00, sum01 ; sum0 += pol 
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Example 5-57. Linear Assembly for FIR With Outer Loop Conditionally Executed 
With Inner Loop (Continued) 


PY h23,x23,p02 7; p02 = h[it2]*x[j+it2 
ADD p02,sum01, sum02 ; sum0 += p02 
PYH h23,x23,p03 7; p03 = h[it3]*x[j+it+3 
ADD p03,sum02, sum03 ; sum0 += p03 
PY h45,x45,p04 ; p04 = h[it4]*x[j+it4 
ADD p04, sum03, sum04 ; sum0 += p04 
PYH h45,x45,p05 7; pOS = h[it5]*x[j+it5 
ADD p05,sum04, sum05 ; sum0 += p05 
PY h67,x67,p06 ; pO6& = h[it6]*x[j+it6 
ADD p06, sum05, sum06 ; sum0 += p06 
PYH h67,x67,p07 7; pO7 = h[it7]*x[j+it7 
ADD p07, sum06, sum07 ; sum0 += p07 
'sctr VK 4,sctr ; reset store lp cntr 
{pctr SUB pctr,1,pctr ; dec pointer reset lp cntr 
Lpetr SUB x, cst, x ; reset x ptr 
'pctr SUB xl, FSeER EL eed ; reset x_l ptr 
'pctr SUB h,xrsthl,h ; reset h ptr 
'pctr SUB h_1,rsth2,h_1 ; reset h_1 ptr 
'pctr MVK 4,pctr ; reset pointer reset lp cntr 
foctr SUB octry Ly, oecryr ; dec outer lp cntr 
[octr B LOOP ; Branch outer loop 


5.13.6 Translating C Code to Linear Assembly (Inner Loop and Outer Loop) 


Example 5-58 shows the linear assembly with functional units assigned. (As 
in Example 5—49 on page 5-98, symbolic names now have an A or B in front 
of them to signify the register file where they reside.) Although this allocation 
is one of many possibilities, one goal is to keep the 1X and 2X paths to a 
minimum. Even with this goal, you have five 2X paths and seven 1X paths. 


One requirement that was assumed when the functional units were chosen 
was that all the sum0 values reside on the same side (A in this case) and all 
the sum1 values reside on the other side (B). Because you are scheduling 
eight accumulates for both sum0 and sum1 in an 8-cycle loop, each ADD must 
be scheduled immediately following the previous ADD. Therefore, it is undesir- 
able for any sum0 ADDs to use the same functional units as sum1 ADDs. 


One MV instruction was added to get x01 on the B side for the MPYLH p10 
instruction. 
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Example 5-58. Linear Assembly for FIR With Outer Loop Conditionally Executed 
With Inner Loop (With Functional Units) 


amath ois 


LOOP: 


[sctr] 
'sctr] 
'sctr] 
'sctr] 
'sctr] 


-global 
-cproc 


.reg 
.reg 
.reg 
.reg 
.reg 
.reg 
.reg 


ADD 
ADD 
ADD 
MVK 
MVK 
MVK 
MVK 
MVK 
MVK 

K 


Pe 
x, h, y 
sol hi ky yoy OCC Ky, “peter ,~.setr 
sum01, sum02, sum03, sum04, sum05, sum06, sum07 
sumll, suml2, sum13, suml14, suml15, suml16, suml7 
p00, pOl, p02, p03, p04, p05, pdb, pd07 
plo, pll, pl2, p13, p14, p15, plé6, pl7 
xOlb, x01, x23, x45, x67, x8, hO1, h23, h45, h67 
yO, yl, rstxl, rstx2, rsthl, rsth2 
x,4,x_1 ; point to x[2] 
h,4,h_1 ; point to h[2] 
yr2,y_1 ; point to y[1] 
60,xrstxl ; used to rst x pointer each outer 
60,rstx2 ; used to rst x pointer each outer 
64,rsthl ; used to rst h pointer each outer 
64,rsth2 ; used to rst h pointer each outer 
201,octr ; loop ctr = 201 = (100/2) * (32/8) 
A pcEer j; pointer reset lp cntr = 32/8 
5,sctr ; reset store lp cntr = 32/8 + 1 
sum07 ; sum07 = 0 
suml17 ; suml7 = 0 
xX, x+0 
ey Be 
h, h+0 
hl, ht4 
D1TI *ht++[2],hO1l ; h[itO] & h[itl 
D2T2 *h_1++[2],h23; h[it2] & h[it+3 
D1TI *ht++[2],h45 ; h[it4] & h[it+5 
D2T2 *h_1++[2],h67; h[it6] & h[it7 
D2T1 *x++[2],x01 ; x[jtit0] & x[jt+it+l] 
D1T2 kx 14+4+([2],x23; x[j+it2] & x[j+it3] 
D2T1 *x++[2],x45 ; X[jtit4] & x[jt+it5] 
D1T2 xx 14++[2],x67; x[j+it6] & x[j+it7] 
D2TI1 *x, x8 ; x[jtit+8] 
el sctr,1,sctr ; dec store lp cntr 
<SL sum07,15,y0 ; (sum0 >> 15) 
2S2 sum17,15,y1l ; (suml >> 15) 
.D1 yO,*y++[2]  ; yl3] = (sum0 >> 15) 
«D2 yl,*y_1++[2] ; yfj+1] = (suml >> 15) 


loop 
loop 
loop 
loop 
+ 1 
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Example 5-58. Linear Assembly for FIR With Outer Loop Conditionally Executed 
With Inner Loop (With Functional Units) (Continued) 


MV eL2X x01,x0l1b 7 move to other reg file 
MPYLH -M2X h01,x01b,p10 ; plO = h[it0O]*x[j+i+1] 
{[sctr] ADD ~L2 pl0,suml17,p10 7 suml(pl10) = pl0O + suml 

PYHL .M1X h01,x23,pl1l1 , pll = h[itl]*x[ j+it+2 
ADD ~L2X pll1,p10,suml1l1 ; suml += pll 

PYLH .M2 h23,x23,p12 ; pl2 = h[it2]*x[j+it3 
ADD ~L2 p12,suml11,suml12 , suml += pl2 

PYHL .M1X h23,x45,p13 , pl3 = h[it+3]*x[ j+it+4 
ADD .L2X p13,suml12,sum13 , suml += pl3 

PYLH -M1 h45,x45,p14 , pl4 = h[it+4]*x[ j+it+5 
ADD .L2X p14, suml13,suml14 , suml += pl4 

PYHL ~M2X h45,x67,pl15 ; plS = h[it5]*x[j+it+6 
ADD .S2 p15,suml14,suml15 , suml += pl5 

PYLH .M2 h67,x67,p16 ; plé = h[it6]*x[j+it7 
ADD ~L2 p16, suml15,suml16 ; suml += pl6é 

PYHL x h67,x8,pl17 ; pl7 = i1+7]*x[j+it8 
ADD L2X p17,suml6,suml17 , suml += pl7 

PY h01,x01,p00 ; pOO = h[it0O]*x[j+1i+0 

{[sctr] ADD L, p00, sum07, p00 7 sum0(p00) = pod + sum0d 

PYH h01,x01,p01 ; pOl = h[itl]*x[jtitl 
ADD L, p01,p00, sum01 , sum0 += pol 

PY M2 h23,x23,p02 ; p02 = h[it2]*x[j+it2 
ADD -L1X p02, sum01,sum02 , sum0 += p02 

PYH .M2 he 3 x23; 003 ; p03 = h[it3]*x[j+it3 
ADD .L1xX p03, sum02,sum03 , sum0 += p03 

PY -M1 h45,x45,p04 ; p04 = h[it4]*x[ j+it4 
ADD ~L1 p04, sum03, sum04 ; sum0 += p04 

PYH -M1 h45,x45,p05 ; pOS = h[it5]*x[ j+it5 
ADD ~L1 p05, sum04, sum05 7 sum0 += p05 
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Example 5-58. Linear Assembly for FIR With Outer Loop Conditionally Executed 
With Inner Loop (With Functional Units)(Continued) 


MPY .M2 h67,x67,p06 ; p06 = h[it6]*x[j+it+6] 
ADD -L1X p06, sum05, sum06 ; sum0 += p06 
MPYH .M2 h67,x67,p07 ; pO7 = h[it7]*x[j+it+7] 
ADD -L1X p07, sum06, sum07 ; sum0 += p07 
'sctr MVK sol 4, 8ctxr ; reset store lp cntr 
[pctr SUB Sd pctr,1,pctr ; dec pointer reset lp cntr 
'pctr SUB <B2 R, LSECK Sk ; reset x ptr 
lpebr SUB -S1 x_l,rstxl,x_l ; reset x_l ptr 
'pctr SUB Pel hy retin oh ; reset h ptr 
'pctr SUB <2 hol,-rsth2;h. 1 ; reset h_l ptr 
'pctr MVK ~S1 4,pctr ; reset pointer reset lp cntr 
foetr SUB aoe octr; 1,0ctr ; Gec outer lp cntr 
[6cer B 82 LOOP 7 Branch outer loop 
endproc 
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5.13.7 Determining the Minimum Iteration Interval 


Based on Table 5-21, the minimum iteration interval is 8. An iteration interval 
of 8 means that two multiply-accumulates per cycle are still executing. 


Table 5-21. Resource Table for FIR Filter Code 


(a) A side (b) B side 

Unit(s) Total/Unit Unit(s) Total/Unit 
M1 8 .M2 8 

S1 7 S2 6 

.D1 5 .D2 6 

-L1 8 .L2 8 
Total non-.M units 20 Total non-.M units 20 

1X paths 7 2X paths 5 


5.13.8 Final Assembly 


Example 5-59 shows the final assembly for the FIR filter with the outer loop 
conditionally executing in parallel with the inner loop. 


Part Ill 
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Example 5-59. Final Assembly Code for FIR Filter 


MV -L1X B4,A0 ; point to h[0] & h[1] 
ADD .D2 B4,4,B2 ; point to h[2] & h[3] 
MV ~L2X A4,B1 ; point to x[j] & x[jt1] 
ADD .D1 A4,4,A4 ; point to x[j+2] & x[j+3] 
MVK 82 200,BO ; set lp ctr ((32/8)*(100/2) ) 
LDW .D1 *A4++[2],B9 ; x[Jtit2] & x[j+it3] 
LDW .D2 *B1++[2],A10 ; x[jtit0] & x[Jjtit+l] 
MVK OL 4,Al ; set pointer reset lp cntr 
LDW .D2 *B2++[2],B7 ; hlit2] & h[it+3 
LDW .D1 *AQ++[2],A8 ; h[itO] & h[itl 
MVK 28a: 60,A3 ; used to reset x ptr (16*4-4) 
MVK 82 60,B14 ; used to reset x ptr (16*4-4) 
LDW .D2 *B1++[2],A11 ; x[jtit4] & x[j+it5] 
LDW .D1 *A4++[2],B10 ; x[jtité6] & x[Jj+it+7] 
[Al] SUB edi: Al1,1,Al1 ; dec pointer reset lp cntr 
MVK ob 64,A5 ; used to reset h ptr (16*4) 
MVK 282 64,B5 ; used to reset h ptr (16*4) 
ADD .L2X A6,2,B6 ; point to y[jt1] 
LDW .D1 *AQ++[2],A9 ; hlit4] & h[it5] 
LDW .D2 *B2++[2],B8 ; h[ité] & h[it7] 
[!A1] SUB ecwe A4,A3,A4 ; reset x ptr 
{!Al1] SUB S2 B1,Bl14,Bl ; reset x ptr 
['!Al] SUB Sl A0Q,A5,A0 ; reset h ptr 
LDH D2 *B1,A8 ; x[j+its] 
ADD ~S2X A10,0,B8 7; move to other reg file 
VK oa! 5, AZ ; set store lp cntr 
PYLH .M2X A8,B8,B4 ; plO = h[it0O]*x[Jj+it+1] 
['Al] SUB ~82 B2,B5;B2 ; reset h ptr 
PYHL .M1X A8,B9,A14 ; pll = h[itl]*x[j+i+2] 
PY .M1 A8,A10,A7 ; pOO = h[it0]*x[j+it+0] 
PYLH .M2 B7,B9,B13 ; pl2 = h[it2]*x[j+it+3] 
[A2] SUB -S1 A2,1,A2 ; dec store lp cntr 
ZERO sL2 Bll ; zero out initial accumulator 
{[!A2] SHR ~S2 Bld, -157BL1 ; (Bsuml >> 15) 
MPY .M2 B7,B9,B9 ; p02 = h[it2]*x[j+i+2] 
MP YH .M1 A8,A10,A10 ; pOl = h[itl]*x[j+it+l] 
[A2] ADD ~L2 B4,Bl11,B4 7 suml(pl10) = p10 + suml 
LDW .D1 *A4++[2],B9 7* x[jtit2] & x[jt+it3] 
LDW .D2 *B1++[2],A10 7* x[jt+it0] & x[jt+itl] 
ZERO ainsi A10 yj zero out initial accumulator 
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Example 5—59. Final Assembly Code for FIR Filter (Continued) 


LOOP: 
[!A2] SHR -S1 A10,15,A12 ; (Asum0 >> 15) 
[BO] SUB oe BO,1,BO ; dec outer lp cntr 
MP YH .M2 B7,B9,B13 ; p03 = h[it3]*x[j+it+3] 
[A2] ADD ~L1 A7,A10,A7 , sum0(p00) = p0dO + sum0 
MPYHL -M1X B7,A11,A10 ; pl3 = h[it3]*x[j+it+4] 
ADD ~L2X A14,B4,B7 , suml += pll 
LDW »D2 *B2++[2],B7 pe hy LDAZ! Ge. el aS 
LDW sD *AO++[2],A8 7* h[it0O] & h[itl] 
ADD salads A10,A7,A13 ; sum0 += pol 
MPYHL ~M2X A9,B10,B12 ; plS = h[it5]*x[j+it+6] 
MPYLH -M1 A9,A11,A10 ; pl4 = h[it4]*x[j+it+5] 
ADD ~L2 B13,B7,B7 ; suml += pl2 
LDW -D2 *B1++[2],A11 7* x[jtit4] & x[jt+it5] 
LDW -D1 *A4++[2],B10 7* x[jtit6] & x[jt+it7] 
{Al] SUB -S1 A1l,1,Al1 ;* dec pointer reset lp cntr 
[BO] B aSZ LOOP ; Branch outer loop 
MPY -M1 A9,A11,A11 ; p04 = h[it4]*x[j+i+4] 
ADD ~L1X B9,A13,A13 ; sum0 += p02 
MPYLH .M2 B8,B10,B13 ; plo = h[it+6]*x[Jj+it+7] 
ADD ~L2X A10,B7,B7 ; suml += pl3 
LDW -D1 *A0++[2],A9 7;* h[it4] & h[it5] 
LDW -D2 *B2++[2],B8 7* h[it6] & h[it7] 
{!A1] SUB als A4,A3,A4 7* reset x ptr 
MPY .M2 B8,B10,Bl11 ; p06 = h[it6]*x[j+it+6] 
MP YH -M1 A9,A11,A11 ; pOS = h[it5]*x[j+it5] 
ADD .L1X B13,A13,A9 , sum0 += p03 
ADD ~L2X A10,B7,B7 , suml += pl4 
{!Al1] SUB “52 B1,B14,Bl 7* reset x ptr 
{!Al1] SUB ok AO,A5,A0 7* reset h ptr 
LDH .D2 *B1,A8 ;* x[jtit8] 
[!A2] MVK -S1 4, B2 ; reset store lp cntr 
MP YH .M2 B8,B10,B13 ; pO7 = h[it7]*x[j+it+7] 
ADD L A11,A9,A9 ; sum0 += p04 
MPYHL xX B8,A8,A9 > pl7 = Alit+7]*x[j+i+8] 
ADD S2 B12,B7,B10 ; suml += pl5 
{!A2] STH D2 Bll, *B6++[2] + y[jt1] = (Bsuml >> 15) 
{!A2] STH Dib A12,*A6++[2] ; yj] = (Asum0 >> 15) 
ADD L2X A10,0,B8 ;* move to other reg file 
ADD -L1 A11,A9,A12 , sum0 += pd5 
ADD ~L2 B13,B10,B8 , suml += pl6 
MPYLH ~M2X A8,B8,B4 7* plO = h[it0]*x[j+it1] 
{!Al] MVK SL 4,Al 7* reset pointer reset lp cntr 
{!Al1] SUB ~S2 B2,B5,B2 7* reset h ptr 
MPYHL -M1X A8,B9,A14 7* pll = h[itl]*x[j+it2] 
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Outer Loop Conditionally Executed With Inner Loop 


Example 5-59. Final Assembly Code for FIR Filter (Continued) 


ADD .L2X A9,B8,B11 ; suml += pl7 
ADD -L1X B11,A12,A12 ; sum0 += p06 
MPY .M1 A8,A10,A7 7* pOO = h[it0O]*x[j+i+0] 
MPYLH .M2 B7,B9,B13 7* pl2 = h[it2]*x[j+it+3] 
[A2] SUB 0 A2,1,A2 7* dec store lp cntr 
ADD .L1X B13,A12,A10 ; sum0 += p07 
[!A2] SHR 52 B11,15,Bl11 ;* (Bsuml >> 15) 
MPY .M2 B7,B9,B9 7* p02 = h[it2]*x[j+i+2] 
MP YH .M1 A8,A10,A10 7* pOl = h[itl]*x[j+it+1] 
[A2] ADD wdaZ B4,B11,B4 7* suml(p10) = p10 + suml 
LDW .D1 *A4++[2],B9 7** x[jtit2] & x[j+it3] 
LDW {D2 *B1++[2],A10 7** x[jt+it0] & x[jtit+l] 
;Branch occurs here 
{!A2] SHR -SL A10,15,A12 ; (Asum0 >> 15) 
{[!A2] STH .D2 B11, *B6++[2] > yljtl] = (Bsuml >> 15) 
|| [!A2] STH .D1 Al12, *A6++[2] > ylj] = (Asum0 >> 15) 


5.13.9 Comparing Performance 


The cycle count of this code is 1612: 50 (8 x 4+0) + 12. The overhead due 
to the outer loop has been completely eliminated. 


Table 5-22. Comparison of FIR Filter Code 


Code Example Cycles Cycle Count 


Example 5-42 FIR with redundant load elimination 50 (16 xX 2+9+6)+2 2352 

Example 5-50 =-FIR with redundant load elimination and no memory 50 (8 x 4+10+6)+2 2402 
hits 

Example 5-52 ~-FIR with redundant load elimination and no memory 50(7 x 4+6+6)+6 2006 
hits with outer loop software-pipelined 

Example 5-55 _~—~-FIR with redundant load elimination and no memory 50 (8 x 4+0)+12 1612 
hits with outer loop conditionally executed with inner 
loop 
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This appendix provides extensive code examples from the Global Systems for 
Mobile Communications (GSM) enhanced full-rate (EFR) vocoder. The assem- 
bly code examples in this appendix represent hand-optimized code; the code 
produced by the assembly optimizer will vary, depending on the version used. 
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A.1 Summary of Major Programming Methods 


A-2 


The key to implementing applications on the ’C62xx is to take advantage of the 
processor’s full soeed. The main technique for achieving this goal involves un- 
rolling software loops to reach the limits of the functional units while meeting the 
data dependency constraints. 


In addition to loop unrolling, the following methods are helpful for improving 
performance: 


(1 Rearranging the C code 


If you are implementing a system based on an existing C code, rearranging 
the tasks in the C code is a useful method to gain better performance. 


(J Avoiding memory bank hits 


Memory bank hits, especially those in the inner loop in a nested loop 
application, hurt the performance dramatically and must be avoided. Most 
of the memory bank hits, however, can be eliminated by allocating the 
relevant arrays properly. Some situations, like accessing a word and a half- 
word in the same cycle, can also create the chance of a memory bank hit 
and should also be avoided. 


If the system implementation is quite complicated, the program-memory size 
becomes an issue. To achieve a good balance between program-memory size 
and speed, you can implement the less critical portions with highly-compact 
assembly code that sacrifices performance. 
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A.2 Implementation of the GSM EFR Vocoder 


This section presents the implementation of some representative pieces of code 
for the Global Systems for Mobile Communications (GSM) enhanced full-rate 
(EFR) vocoder. These include the: 


Multiply-accumulate loop 

Windowing and scaling part of autocorr.c 
cor_h 

rrv computation in search_10i40 

Index search in search_10i40 

FIR filter (residu.c) 

Lag search in the lag_max ( ) routine 


OCUUOUUUL 


Note: 


European Telecommunications Standards Institute (ETSI) has the copyright 


to all the C code used in this section. 
| 


The following global constants/symbols are defined in the EFR vocoder: 


#define Word16 — short 

#define Word32 _ int 

#define MAX_32  Ox’fffffffL 
#define MIN 32 Ox80000000L 
#define MAX_16 Ox’fff 
#define MIN_16 0x8000 
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A.2.1 Implementation of the Multiply-Accumulate Loop 


First, examine the most popular loop used in almost every fixed-point vocoder, 
the multiply-accumulate (MAC) loop, shown in Example A-1. 


Example A—1. C Code for the Typical MAC Loop 


input: 
Word16 WN; (typical value of N is an even integer, 
greater than or equal to 20) 
Word1l6 *x, *y; 
result: 
Word32 sum; 
C Code 
sum=0; 
for (i=0;i<N;it+) sum=L_mac(sum,x[i],yl[i]); 
where L_mac(a,b,c) = _sadd(a,_smpy (b,c) ) 


Example A-2 shows a list of symbolic instructions for each iteration of the loop. 


Example A-2. Linear Assembly for the MAC Loop 


LOOP: 


Lent | 
fentr] 


B 


LDH 
LDH 
SMPY 
SADD 
SUB 


abo EO 


*xptrtt+, xi load x[i] 
*yptrt+t+, yi load y[il] 
xi, yi,tmp smpy (x[i],y[i]) 


sum=sadd (sum, smpy (x[i],y[il) 
decrement the loop counter 
branch to the loop 


sum, tmp, sum 
entr,l,cntr 
LOOP 


oe 
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In Example A—2, xpir is the pointer for the x array and yptr is the pointer for the 
y array. Because there are eight functional units, these instructions can easily 
fit into one execution packet. 


In general, unrolling the loop once as in the code in Example A-3 does not give 
the same result as the code shown in Example A-1, because of the ordering 
dependence of the saturated addition. 
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Example A-—3. C Code for MAC Loop With Loop Unrolling 


Word32 sum_e, sum_o; 
sum_e=0; 
sum_o=0; 
for (i=0;i<N;it=2) { 
sum_e=L_mac(sum_e,x[i],y[il); 
sum_o=L_mac(sum_o,x[i+1],y[it1l]); 
} 


sum=L_add(sum_o, sum_e); 


where L_add(a,b)=_sadd(a,b) 


However, both approaches lead to the same result if x[i] = y[i] for every i, Be- 
cause _smpy (x{i], x[i]) is always greater than or equal to 0. This special MAC 
loop is used to compute the energy of a particular signal segment. In this case, 
take the approach shown in Example A-3, because it doubles the performance 
of the code shown in Example A—2. Example A—4 shows the C code for this spe- 
cial MAC loop. Example A-5 lists the symbolic instructions for this loop. 


Example A-4. C Code for Energy Computation MAC Loop 


sum=0; 
for (i=0; 1<N; i++) 
sum = L_mac(sum,x[i],x[1i]); 
or 
sum_e=0; 
sum_o=0; 


for (i=0;i<N;it=2) { 
sum_e=L_mac(sum_e, x[i],x[i]); 
sum_o=L_mac(sum_o,x[it+1l],x[it+1]); 


} 


sum=L_add(sum_o, sum_e) ; 


Example A-5. Linear Assembly for Energy Computation MAC Loop 


LOOP: S 
LDH D *xptre+t+, xi ; load x[i = 
SMP Y -M xi,xi,tmp_e ; smpy(x[i],x[i]) s 
SADD L sum_e,tmp_e, sum_e ; Ssum_e=sadd(sum_e, smpy(x[i],x[i]) o 
LDH <D *xptrott, xitl ; load x[itl] 
SMP Y -M xit+l,xi+l,tmp_o ; smpy(x[itl],x[it1]) 
SADD L sum_o,tmp_o,sum_o ; sSum_o=sadd(sum_o,smpy (x[it+1],x[i+1]) 

[centr] SUB S'-entxr, 2,-cntr ; decrement the loop counter 

[cntr] B S LOOP ; branch to the loop 
SADD -L sum_e, sum_o, sum ; sum=sadd(sum_o+sum_e) 
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In Example A-5, xptre and xptro are the pointers for the x array and, at the be- 
ginning, point to x[0] and x[1], respectively. The eight instructions in the loop fit 
perfectly into one execution packet. This approach computes two MACs in one 
cycle. It doubles the performance of the code shown in Example A-—2 for the 
general MAC loop. 


The final assembly code is shown in Example A-6. 


Example A-6. Assembly Code for the Energy Computation MAC Loop 


** 
** 
** 
** 
** 
** 
** 
** 
** 
** 
** 
** 
kk 
** 
**K 


KKK KK KKK KK KKK KKK KKK KKK KK KKK KKK KKK KKK KK KKK KKK KKK KKK KKK KKK KKK KK KKK KKK KKK KKK KKK KKK 


KKK KKK KKK KKK KKK KKK KKK KKK KKK KKK KKK KKK KKK KEK KKK KKK KKK KKK KKK KK KKK KKK KKK KKK KKK KKK KKK 


Texas Instruments, Inc ces 
K* 

MAC Loop -- Energy Computation as 
K* 

Compute two samples a time Ke 
K* 

Total cycles = (N/2+2) ** 
K* 

Register Usage: A B ax 
4 is) as 

K* 

Notice that x[0] and x[1] will not be available till LOOP xe 
is executed once. Therefore, sum_e and sum_o should be Os em 
for the first three iterations. This is why A5, B5, A6, a 
and B6 should be set to Os in the prolog. ae 


; A4 -- &x[0] 
“B45 N 
y AG => sum 
ADD .L2X A4,2,B4 ; &xX[1] 
SUB .D2 B4,6,Bl1 ; loop counter 
B eo2 LOOP ; branch to the loop 
MVK ol 0,A6 ; initialize sum_e 
LDH .D1 *R4++[2],A5 ; load x[0] 
LDH .D2 *B4++[2],B5 ; load x[1] 
B 682 LOOP ; branch to the loop 
MV ~L2X A6,B6 ; initialize sum_o 
LDH .D1 *A4++[2],A5 ; load x[2] 
LDH .D2 *B4++[2],B5 ; load x[3] 
B 26 LOOP ; branch to the loop 
MV ~L1 A6,A5 ; take care the initial three iterations 
MV ~L2 B6,B5 ; take care the initial three iterations 
LDH ag Del *R4++[2],A5 ; load x[4] 
LDH sDz “Batti 2 |, BS ; load x[5] 
B Sil LOOP 
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Example A-6. Assembly Code for the Energy Computation MAC Loop (Continued) 

LDH -D1 *A4+4+[2],A5 : load x[6] 

| LDH :DZ *B4++[2],B5 7 load x[7] 

LOOP 
SMPY .MI A5,A5,A7 ; smpy (x[i],x[i]) 

| | SMPY .M2 B5,B5,B7 ; smpy (x[it1l],x[i+l]) 

| | SADD Ll A7,A6,A6 q sum_e=sadd(sum_e, smpy (x[i],x[i])) 

| | SADD ~L2 B7,B6,B6 H sum_o=sadd(sum_o,smpy (x[i+1],x[i+1])) 

| | LDH DI *A4+4+[2],A5 ; load x[i] 

|| LDH .D2 *B4++[2],B5 ; load x[itl 

|| [B11] B -S] LOOP 7 branch to the loop 

|| [Bl] SUB wO2 B1,2,Bl1 ; decrement loop counter 
SADD LI A6,B6,A6 ; final result, sum = sum_e + sum_o 


A.2.2. Implementation of the Windowing and Scaling Part of autocorr.c 


The autocorr.c routine is one of the most computationally intensive modules 
in the EFR vocoder. The part used in Example A-7 is used for windowing 
speech samples and for scaling down the windowed sample sequence if the 
input level is too high. Figure A—1 shows the flow diagram for this code. 
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Example A—7. C Code for the Windowing and Scaling Part of autocorr.c 


#define L_WINDOW 240 


input: 
Word16 x[L_WINDOW], wind[L_WINDOW]; 


local variables/arrays: 
Wordl6 i; 
Wordl6 y[L_WINDOW]; 
Word32 sum; 
Wordl16 overfl, overfl_shft; 


Original C code: 


/* Windowing of signal */ 
for (i = 0; i < L_WINDOW; i++) 


y[i] = mult_r (x[i], wind[il]); 


} 
/* Compute r[0] and test for overflow */ 
overfl_shft = 0; 


do 
{ 


for (i = 0; i < L_WINDOW; i++) 


sum = L_mac (sum, y[il]l, ylil); 


/* If overflow divide y[] by 4 */ 


if (L sub (sum, MAX_32) == OL) 
{ 
overfl_shft = add (overfl_shft, 4); 
Overt]. “= J) /* Set the overflow flag */ 


for (i = 0; i < L_WINDOW; i++) 
yli] = shr (y[il, 2); 


} 
} 
while (overfl != 0); 
Where mult_r(a,b) = _sadd(_smpy (a,b) ,0x8000L) >>16 
L_mac(a,b,c)= _sadd(a,_smpy (b,c) ) 
L_sub(a,b) = _ssub (a,b) 
add(a,b) = ((_sadd( (a) <<16, ((b) <<16)))>>16) 
shr(a,b) = ((b)<0O ? (_sshl((a), (-b+16))>>16): ((a)>>(b))) 
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Figure A—1. Flow Diagram for the Windowing and Scaling Part of autocorr.c 


for(i = 0;i < L_WINDOW; i++) ] LOOP 1 
yi] = mult_r (x[i], wind{i]) 


for(i = Osi < L_WINDOW; i++) ] LOOP 2 
sum = L_mac (sum, yli], yli]) 


L_sub (sum, MAX_32) == OL 


Yes 


for(i = 0;1<L_WINDOW, i++) | 90P 3 


yli] = shr (y[i], 2) 


A.2.2.1_ Unrolling the Loop 
Try the loop unrolling technique for each loop. 


Example A-8 shows the list of symbolic instructions needed to execute one it- 
eration of loop 1. Youcan use any arithmetic logic unit (ALU) for the loop-count- 
er update. 


Example A-—8. Linear Assembly for One Iteration of autocorr.c (Loop 1) 


= 
LOOP1: t 
LDH -D *windptr++, windi ;load wind[i] © 
LDH -D *xptrt+, xi ;load x[i] 
SMPY .M windi, xi, windxi0 ;smpy (x[i],wind[i]) 
SADD .L windxi0,0x8000L,windlxil ;sadd(smpy(x[i],wind[i]),0x8000L) 
SHR ais) windxil,16,yi ;sadd(smpy (x[i],wind[i]),0x8000L) >>16 
STH «D yi, *yptrt++ ;store y[i] 
[centr] SUB . ALU entr,.1, cntr ;decrement loop counter 
[centr] B “3 LOOP 1 ;branch to loop 
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In Example A-8, windptr, xptr, and yptr are the pointers of wind, x, and y. 


The .D unit is used most often (three times). With properly partitioned resources, 
this is a 2-cycle loop. 


If you unroll the loop once and load both x and wind in words (in GSM EFR, 
both x and wind can be loaded in words if they are map-aligned with the word 
boundary), you can compute two y values with two cycles. The following is the 
new list of the instructions in one loop iteration. 


Example A-9. Linear Assembly for Loop 1 of autocorr.c (Using LDW) 


LOOP 1: 


[centr] 
[centr] 


LDW 
LDW 
SMPY 
SMP YH 
SADD 
SADD 
SHR 
SHR 
STH 
STH 
SUB 
B 


HnnHnvoHHHEHE ZOD 


*windptr++,windi_windi+l ;load wind[i] and wind[it1] 
*xptrt++,xi_xitl ;load x[i] and x[i+1] 

windi_windi+1, xi_xi+1,windxi0 ;smpy (x[i],wind[i]) 

windi_windit1, xi_xit+1,windxi0+1 ;smpy (x[it1],wind[i+1]) 

windxi0, 0x8000L, windxil ;sadd(smpy (x[i],wind[i]),0x8000L) 
windxi0+1,0x8000L, windxil+1 ;sadd(smpy (x[i+1],wind[it+1]),0x8000L) 
windxil,16,yi ;sadd(smpy (x[i],wind[i]),0x8000L) >>16 
windxil+1,16,yitl ;sadd(smpy (x[i+1],wind[it+1]),0x8000L) >>16 
yi, *yptret+[2] ;store y[il 

yitl, *yptrot+ [2] ;store y[itl] 

entr,2,cntr ;decrement loop counter 

LOOP1 ;branch to loop 


In Example A~9, yptre and yptro are the pointers for the y array and, at the be- 
ginning, point to y[0] and y[1], respectively. 


Note: 


Loop 2 is a special MAC loop, as described in section A.2.1 on page A-4. It 
can be implemented either as shown in Example A-10 without loop unrolling 


or as in Example A—-11 with loop unrolling for one iteration. 
a) 


Example A-10. Linear Assembly for Loop 2 of autocorr.c (No Loop Unrolling) 


LOOP2: 


Lente] 
Lemied 


LDH 


SMPY 
SADD 


SUB 
B 


NN FU 


*yptrt+t+,yi ;load y[il 

yi,yi,yyi ismpy(y[il,y[il) 

sum, yyi, sum ;sadd(sum, smpy (y[i],yl[il)) 
Cntr, 1, Cnke ;decrement loop counter 
LOOP2 ;branch to loop 
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Example A-11. Linear Assembly for Loop 2 of autocorr.c (With Loop Unrolling) 


LOOP2: 

LDH D *yptrett+, yi ;load y[i 

LDH -D *yptrott+,yitl ;load y[i+1] 

SMPY  .M yi,yi,yyi ;smpy (y[i]l,y[il) 

SMP Y M yit+l,yitl,yyitl ;smpy (y[it+1],y[it+1]) 

SADD oy sum_e, yyi, sum_e ;sadd(sum_e, smpy (y[i],y[il]) 

SADD ag sum_o, yyitl,sum_o ;sadd(sum_o,smpy (y[it+1],y[i+1])) 
[centr] SUB Ss entr,2,cntr ;decrement loop counter 
[centr] B ae) LOOP2 ;branch to loop 

SADD ~L sum_e, Sum_o, sum , sum=Ssum_ot+sum_e 


Later, you will see that both approaches are used in this application. 


Loop 3 is a single-cycle loop and you cannot speed it up by simply unrolling 
the loop. The instructions for each iteration are shown in Example A-12. 


Example A-12. Linear Assembly for Loop 3 of autocorr.c 


LOOP3: 

LDH 2D *xyptrl+t+,yi ;load y[i] 

SHR “5 yi,2,yid ;shr(y[i],2) 

STH aD: yi0, *yptrs++ ;store y[i]=shr(y[i],2) 
[centr] SUB .L cner;. 1) Cntr ;decrement loop counter 
[centr] B ite} LOOP 3 ;branch to loop 


In Example A-12, yptrl is the pointer for loading the y array and ypirs is the 
pointer for storing the y array. 


The new flow diagram is shown in Figure A-2. 
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Figure A—2. Flow Diagram for autocorr.c With Loop Unrolling 


for(i = 0; i < L_WINDOW; i+=2) { 
y[i] = mult_r (x[i], wind[i]) 
y[i+1] = mult_r (x[i+1], wind[i+1]) 


Loop 1 


for(i = 0; i < L_.WINDOW; i++) { Loop 2 
sum_o = L_mac (sum_o, y[Ii], y[i]) 
sum_e = L_mac (sum_e, y[i+1], y[i+1]) 


sum = sum_o+sum_e 


L_sub (sum, MAX_32) == OL 
? 


for(i = 0; i < L_WINDOW;; i++) Loop 3 
y[i] = shr(y[i], 2) 


A.2.2.2 Rearrange the C Code 
The first execution of loop 2 can be combined with loop 1 to form a new loop | 
and its subsequent executions can be combined with loop 3 to form a new 


loop Il. 


Another small change is the implementation of if L_sub(sum, MAX_32) == OL 
as if sum == MAX_32. 


The new flow diagram with rearranged C code is shown in Figure A-3. 
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Figure A—3. Flow Diagram for autocorr.c With Rearranged C Code 


for(i = 0; i < L_WINDOW; i+=2) { 
y[i] = mult_r (x[i], wind[i]) 
y[i+1] = mult_r (x[i+1], wind[i+1]) 
sum_o = L_mac (sum_o, yi], y[i]) 
sum_e=L_mac (sum_e, y[i+1], y[i+1]) 


sum = sum_o+sum_e 


for(i = O;i<L_ WINDOW; i++) { 


y[i] = shr(y[i], 2) 
sum = L_mac (sum, y[i], y[i]) 


} 


You can implement loop | as one of the two approaches as shown in 
Example A-13. 
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Example A-—13. Linear Assembly for Loop | of autocorr.c (Modified) 


LOOPTI: 
LDW -D *windptrt++,windi_windit+l ;load wind[i] and wind[i+1] 
LDW -D *xptrt+t+,xi_xitl ;load x[i] and x[itl] 
SMPY -M windi_winditl, xi_xi+1,windxi0 ;smpy (x[i],wind[i]) 
SMPYH .M windi_windi+1, xi_xit1,windxi0+1 ;smpy (x[i+1],wind[it+1]) 
SADD -L windxi0, 0x8000L, windxil ;sadd(smpy (x[i],wind[i]),0x8000L) 
SADD -L windxi0+1,0x8000L, windxiltl ;sadd(smpy (x[it+1],wind[i+1]),0x8000L) 
SHR -S windxil,16,yi ;sadd(smpy (x[i],wind[i]),0x8000L) >>16 
SHR -S windxil+1,16,yi+l ;sadd(smpy (x[i+1],wind[it+1]),0x8000L) >>16 
SMPY .M yi,yi,yyi ismpy (y[il,y[i]) 
SMPY -M yitl,yitl,yyitl ;smpy (y[it+1],y[it1]) 
SADD -L sum_e,yyi,sum_e ; sum_e=sadd(sum_e, smpy (y[il]l,yl[i])) 
SADD -L sum_o,yyit1l,sum_o ; sum_o=sadd(sum_o, smpy (y[it+1],y[it+1]) 
STD -D yi, *yptret+[2] ;store y[i] 
STD -D yitl, *yptrot++[2] ;store y[it+l] 
[cntr] SUB «S) Cner,.2,enkr 7decrement loop counter 
[cntr] B -S LOOPI ;branch to loop 
or as 
LOOPT: 
LDW -D *windptr++,windi_windit+l ;load wind[i] and wind[it+1] 
LDW .D *xptrt++,xi_xitl ;load x[i] and x[i+1] 
SMP Y -M windi_windi+l, xi_xit+1,windxi0 ;smpy (x[i],wind[i]) 
SMPYH .M windi_windit+1l, xi_xit+1,windxi0+1 ;smpy (x[it+t1],wind[i+1]) 
SADD -L windxi0, 0x8000L, windxil ;sadd(smpy (x[i],wind[i]),0x8000L) 
SADD -L windxi0+1, 0x8000L, windxil+1 ; sadd (smpy (x[i+1],wind[i+1]),0x8000L) 
SHR -S windxil,16,yi ;sadd(smpy (x[i],wind[i]),0x8000L) >>16 
SHR -S windxil+1,16,yit+l ;sadd(smpy (x[i+1],wind[it+1]),0x8000L) >>16 
SMPYH .M windxil,windxil,yyi Hemoy yay bl) 
SMPYH .M windxit+1l,windxi+l, yyit+l ;smpy (y[it1],y[it+1]) 
SADD -L sum_e,yyi, sum_e ; sum_e=sadd(sum_e, smpy (y[i],y[il)) 
SADD -L sum_o,yyitl,sum_o ; sum_o=sadd(sum_o, smpy (y[it+1],y[i+1]) 
STD .D yi, *yptrett+[2] ;store y[i] 
STD -D yitl,*yptrott+[2] ;store y[itl] 
[cntr] SUB -S centr,2,cntr 7decrement loop counter 
{[cntr] B -S LOOPI ;branch to loop 


The only difference between these two implementations is how to compute yyi 
and yyi + 1. Using yyi as an example, the former approach computes yyi follow- 
ing the order of the original C code: 


_smpy (_sadd(_smpy (a,b) ,0x8000L) >>16, 
_sadd(_smpy (a,b) ,0x8000L)>>16), 


yyi = 


The latter computes yyi in a slightly different way as: 


smpyh (_sadd(_smpy (a,b) ,0x8000L), 
_sadd(_smpy(a,b),0x8000L)). 


yyi = 
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This provides the flexibility to better pack the instructions and reduces cycle 
count. 


Loop | is a two-cycle loop. Loop Il is still a single-cycle loop. Its instructions are 
shown in Example A-14. 


Example A—14. Linear Assembly for Loop II of autocorr.c (Modified) 


LOOPII: 
LDH -D *yptrlt+t+,yi ;load y[il] 
SHR oo yi,2,yi0 ;shr(y[i],2) 
SMPY . yi0,yi0,yyi ;smpy(shr(y[i],2),shr(y[il],2)) 
SADD .L sum, yyi, sum ;sum=sadd (sum, smpy(shr(y[i],2),shr(y[i],2))) 
STH -D yi0,*yptrs++ ;store y[i]J=shr(y[i],2) 
[centr] SUB A entr,1,cntr ;decrement loop counter 
[cntr] B ee) LOOPITI ;branch to loop 


A.2.2.3| Memory Bank Hits 
To schedule loop | as a 2-cycle loop: 


LJ xf[i] + x[i+ 1] << 16 and wind[i] + wind[i + 1] << 16 must be loaded in the 
same cycle. 


L) y[i] and y[i+1] must be stored in the other cycle. 
To avoid a memory bank hit: 


[1 Allocate x and wind in different memory spaces, if possible. For instance, 
allocate wind{[i] in data ROM and x in data RAM. 


L1 If no data ROM is available, allocate x and wind so they are offset from 
each other by one word. 


There is no memory bank problem when storing y[i] and y[i + 1]. 


No memory bank hits occur in loop II, because the distance between the load 
and store is always six halfwords. 


Part IV 


The modified C code of this part of autocorr.c is shown in Example A-15. 
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Example A-15. Implemented C Code for autocorr.c 


Wordl6 i; 

Word1l6 y[L_WINDOW] ; 

Word32 sum, sum_e, sum_o; 
Word1l6 overfl, overfl_shft; 


/* Windowing of signal */ 


sum_e=sum_o=OL; 


for (i = 0; i < L_WINDOW; i+t=2) 

{ 
yfi] = mult_r (x[i], wind[i]); 
yfitl] = mult_r(x[itl], window[itl1]); 
sum_e = L_mac(sum_e, y[li], ylil); 
sum_o = L_mac(sum_o, y[itl], y[itl]); 


} 
sum=sum_e+sum_o; 
/* Compute r[0] and test for overflow */ 


overfl_shft = 0; 


do 
{ 
overfl = 0; 
/* If overflow divide y[] by 4 */ 
if (sum == MAX_32) 
{ 
overfl_shft = add (overfl_shft, 4); 
overfl = 1; /* Set the overflow flag */ 
sum=0L; 
for (i = 0; i < L_WINDOW; i++) 
{ 
yli] = shr (ylil, 2); 
sum = L_mac(sum, y[i], ylil); 
} 
} 
} 
while (overfl != 0); 


A.2.2.4 Code-Size Reduction 


Finally, consider the code-size reduction. In Figure A-3 on page A-13, loop | is 
always executed and loop II is executed only for high-input levels. This means 
that cycle count is the most important factor for loop |, while code size is more 
critical for loop Il. 


A.2.2.5 Final Assembly Code 
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The final assembly code is presented in Example A-16. 


Example A-16. Assembly Code for Windowing and Scaling Part of autocorr.c 
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Implementation of The Windowing and Scaling Part of autocorr.c 


In EF 


R 


Compute two samples a time 


Total 


cycles 


257 (No Scaling) 
= 519 (One Scaling) 


Register Usage: 


; B4 -- &x[0] 

; A4 —- &window[0] 

; A6 -- &y[0] 

; B8 -- L_WINDOW 

; AO -- sum and sum_e 
; BO -- sum_o 

; B15 -- stack pointer 


A 
11. 
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; notice that we use the latter approach in Example A-13 


LDW 
LDW 


SHL 
MVK 
ADD 
MV 


LDW 
LDW 
MVKLH 
MV 


-D2 
-D1 
wok 
-S2 


.L1X 
2S2 


D2 
-D1 
vol 
-L1X 


*B4++,B5 
*A4++,A5 
480,A6 
B8,6,Bl 


B15,A6,A6 
1,B7 


A6,2,B6 
A6,A3 


*B4++,B5 
*A4++,05 
32767, A10 
B7,A7 


load x[0] & x[1] 

load wind[0] & wind[1] 
reserve space for y[il] 
LOOP I counter 


&y [0] 
load x[2] & x[3] 
load wind[2] & wind[3] 


32768 or Ox8000L for rounding 


load x[4] & x[5] 

load wind[4] & wind[5] 
7£fffFLff = MAX_32 
32768 
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ok 


Kk 


Kk 
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Example A-16. Assembly Code for Windowing and Scaling Part of autocorr.c (Continued) 


SMPYH .M2X  B5,A5,B2 
SMPY .MLX  B5,A5,A2 
B .S2 LOOPI 
LDW D2 *B4++,B5 
LDW D1 *A4++,A5 
MVK .S1 0,A0 
MVK .S2 0,B0 
SMPYH .M2X  B5,A5,B2 
SMPY MIX  B5,A5,A2 
SADD Ti A2,A7,A2 
SADD .L2 B2,B7,B2 
B .S1 LOOPI 
LDW D2 *B4++,B5 
LDW D *A4++,A5 
SHR s A2,16,A9 
SHR $2 B2,16,B9 
SMPYH .M A2,A2,Al11 
SMPYH .M2 B2,B2,B11l 
LOOPI: 
STH D AQ, *A6++[2] 
STH D2 B9, *B6++[2] 
SADD L A2,A7,A2 
SADD L2 B2,B7,B2 
SMPYH .M2X  B5,A5,B2 
SMPY MIX  B5,A5,A2 
[Bl] SUB $2 B1,2,Bl 
[Bl] B s LOOPI 
SADD L AO,A11,A0 
SADD L2 BO,B11,B0 
SMPYH .M A2,A2,Al11 
SMPYH .M2 B2,B2,B11 
SHR s A2,16,A9 
SHR $2 B2,16,B9 
LDW D2 *B4++,B5 
LDW D *A4++,A5 
SADD L1X  AO,BO,A0 
MPY M2 BO,0,BO 


smpy (x[1],wind[1]) 
smpy (x[0],wind[0]) 


load x[6] & x[7] 
load wind[6] & wind[7] 


sum_o 0 

sum_e = 0 

smpy (x[3],wind[3]) 

smpy (x[2],wind[2]) 

sadd(smpy (x[1],wind[1]),0x8000L) 
sadd(smpy (x[0],wind[0]),0x8000L) 


load x[8] & x[9] 

load wind[8] & wind[9] 

y[1l]=sadd(smpy (x[1],wind[1]),0Ox8000L) >>16 
y [0]=sadd(smpy (x[0],wind[0])+0x8000L) >>16 
smpy (y[0],y[0]) 

smpy (y[1],y[1]) 


store y[1] 

store y[0] 

sadd(smpy (x[3],wind[3]),0x8000L 
sadd(smpy (x[2],wind[2]),0x8000L 
smpy (x[5],wind[5]) 

smpy (x[4],wind[4]) 

decrement the loop counter 


sum_e += smpy(y[0], 
sum_o += smpy(y[1],y{[1]) 
smpy (y[2],y[2]) 

smpy (y[3],y[3]) 


] 
] 
y 
y[2]=sadd(smpy 
x 


y[3]=sadd(smpy (x[3],wind[3]),0x8000L)>>16 
mpy (x[2],wind[2]),0x8000L) >>16 

load x[10] & x[11] 

load wind[10] & wind[11] 

sum = sum_e + sum_o 


overfl_shift = 0 
LOOP I completed 
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Example A-16. Assembly Code for Windowing and Scaling Part of autocorr.c (Continued) 


LTEST: 
CMPEQ Ll 
'Al] B 91 
Al] LDH -D1 
Al] ADD .L2X 
Al] ADD -D2 
Al] LDH .D2 
Al] SUB 282 
Al] B Sl 
Al] MV fi. 
Al] LDH -D2 
Al] MVK ool 
Al] B 282 
Al] LDH ©D2 
Al] B -51 
Al] MV -L1 
Al] LDH -D2 
Al] B sol 
Al] LDH D2 
Al] SHR ~S1X 
Al] B «o2 
LOOPITI: 
LDH D2 
SHR ~S1X 
[Bl] B 32 
STH D1 
[B1] ADD .L2 
SMP Y M 
SADD L 
STH D 
SMP Y M 
SADD hel 
B 282 
SADD L 
SADD L 
NOP 3 
FINISH: 


AO,A10,Al1 


FINISH 
*A3,B5 
A3,2,B9 
BO, 4,B0 


“BOTH BO 
B8,7,Bl 
LOOP II 
A3,A9 
*B9++,B5 
0,A0 
LOOPII 


*B9++,B5 
LOOPII 
AO,A2 


*B9++,B5 
LOOPII 


*B9++,B5 
B5,2,A5 
LOOPIT 


*B9++,B5 
B5,2,A5 

LOOPII 

A5, *A9++ 
B1,-1,Bl1 
A5,A5,A2 
A2,A0,A0 


A5, *A9++ 
A5,A5,A2 
A2,A0,A0 
LTEST 


A2,A0,A0 


A2,A0,A0 


if (sum == MAX_32) 


No, exit 

load y[0] 

&y[1] 

add (overfl_shift, 4) 


load y[1] 
counter for LOOPII 


load y[3] 
to take care of the initial condition 


load y[4] 


load y[5] 
y[0]=shr(y[0],2) 


load y[6] 

yfl] = shr(y[1],2) 
branch 

store y[0] 

decrement LOOPII counter 
smpy (y[0],y[0]) 

sum +=smpy (yli],y[i]) 


store y[n-1] 

smpy (y[n-1],y[n-1]) 

sum +=smpy (y[n-3],y[n-3]) 
branch back to LTEST 


sum +=smpy (y[n-2],y[n-2]) 


sum +=smpy (y[n-1],y[n-1]) 


save the code siz 
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If code size is not an issue, you can eliminate the last three NOPs by ex- 
panding the epilog of loop II. This saves three cycle counts every time loop II 
executes; however, code size increases by two fetch packets (2 x 32 = 64 bytes). 


A.2.3 Implementation of cor_h 


The cor_h routine is the second most computationally intensive routine 
called to compute the matrix of autocorrelation, rr. The core part of cor_his 
presented in Example A-17. 


Example A-17. C Code for cor_h 


#define L_CODE 40 


input: 
Word1l6 sign[L_CODE], h[L_CODE]; 


output: 
Word16 rr[L_CODE] [L_CODE]; 


local variables/arrays: 
Word16 h2[L_CODE]; /* function of h, the impulse response of weighted 
synthesis filter */ 
Wordl6 dec, Jj, i, k; 
Word32 s; 


Original C code 


for (dec=1; dec<L_CODE; dect+t) 
{ 


s = 0; 
j = L.CODE-1; 
i = sub(j, dec); 
for (k=0; k<(L_CODE-dec); k++, i--, j-—) 
s = L_mac(s, h2[k], h2[k+dec]); 
rr[j] [i] = mult (round(s), mult (sign[i],sign[jl)); 
rela lols rela) 2h; 
} 
where sub(a,b) = _ssub(a<<16, b<<16)>>16 
L_mac(a,b,c) = _sadd(a,_smpy (b,c) ) 
mult(a,b) = _smpy(a,b)>>16 
and round(a) = _sadd(a,0x8000L)>>16 


The instructions to execute one iteration of the inner loop are listed in 
Example A-18. 
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Example A-—18. Linear Assembly for cor_h (One Inner Loop Iteration) 


INNERLOOP : 
LDH -D *h2ptr+t+, h2k ;load h2[k 
LDH .D *h2decptrt++,h2deck j;load h2[k+dec] 
SMPY .M h2k, h2deck, h2kk ;smpy (h2[k],h2[k+dec] ) 
SADD .L s,h2kk,s ;sadd(s,smpy (h2[k],h2[k+dec] ) 
SADD él s,0x8000L, sround ; round (s) <<16 
LDH -D *signiptr--,signi ;load sign[i] 
LDH -D *signjptr--,signj ;load sign[j] 
SMPY .M signi, sign}, signij ;smpy (Sign[i],sign[j])=mult (sign[i],sign[4j])<<16 
SMP YH M signij, sround, rrji0 ;L_mult (round(s),mult (sign[i],sign[j])) 
SHR .S rrji0,16,rrji 7xrr[j] [i] 
STH -D rrji, *rrjiptr—-[41] 7store rr[j] [i] 
STH -D rrji, *rrijptr—- [41] 7store rr[i][j] 
[icntr] SUB.ALU Lentr,; lL, itontr ;decrement inner loop counter 
[icntr] B.S INNERLOOP ;branch to inner loop 


In Example A-18, h2ptr and h2decptr are the pointers for h2, pointing to h2[k] 
and h2[k+dec]. The pointers for sign, signiptr and signjptr, point to sign[i] and 
sign[j]. The pointers for rr, rrjiptr and rrijptr, point to rr[j][i] and rr[i][j], respec- 
tively. 


Notice that each element rr[j][i] is implemented as: 
rr[j][i] = (.smpyh (_sadd (s, Ox8000L), _smpy (sign[i], sign[j]))) >> 16 


The .D unit is used most often (six times in the inner loop). Ideally, these 
instructions can be arranged in three cycles. However, memory bank hits oc- 
cur with any combination of the load and/or store instructions. 
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Next, consider unrolling the inner loop once. The C code is shown i 


Example A-19. 


Example A-19. C Code for cor_h (With Inner Loop Unrolling) 


for. 


{ 


(dec=1; dec<L_CODE; dec+t) 


s = 0; 
j= L_CODE-1; 
i = sub(j, dec); 


for (k=0; k<(L_CODE-dec); kt+=2, i-=2, j-=2) 
{ 
L_mac(s, h2[k], h2[k+dec]); 


rr[j] [i] = mult (round(s), mult (sign[i],sign[j])); 
ee Li ia), sre lady 
s = Lmac(s, h2[k+1], h2[k+l+dec]); 
rr[j-1] [i-1] mult (round(s), mult (sign[i-1],sign[j-1])); 
celiet) yeh) er (gol) biel 
} 
if ((dec&1)!=0) { 
s = L_mac(s,h2[L_CODE-dec-1],h2[L_CODE-1]); 
rr[dec] [0] = mult (round(s),mult (sign[0],sign[dec])); 
rr[0] [dec] = rr[dec] [0]; 
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Eight values must be loaded and four values must be stored in every iteration; 
however, h2[k] and h2[k +1] can be loaded in a word. The same is true for 
sign[j] and sign[j—1]. A total of six loads are required. The inner loop 


instructions are shown in Example A-20. 
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Linear Assembly for cor_h (With Inner Loop Unrolling) 


INNER LOOP: 

LDW .D *h2ptr+t+, h2k_h2k+1 ;load h2[k] and h2[k+1] 

LDH mae *h2decptr++, h2deck j;load h2[k+dec] 

SMPY -M h2k_h2k+1,h2deck,h2kk0 ;smpy (h2[k] ,h2 [k+dec] ) 

SADD L, s,h2kk0,s ;sadd(s,smpy (h2[k],h2[k+dec] ) 

SADD L, s,0x8000L, sround ; round (s) <<16 

LDH D *signiptr--, signi ;load sign[i 

LDW D *signjptr--,signj_signj-1 ;load sign[j] and sign[j-1] 

SMPYLH signi, signj_signj-1,signijO ;smpy(sign[i],sign[Jj]) 

SMP YH . signi 0, sround, rrjid ;L_mult (round(s),mult (sign[i],sign[j]) ) 

SHR -S rrjid,16,rr3ji rrr[j] [i] 

STH .D rrji, *rrjiptr——-[82] ;store rr[j] [i] 

STH .D rrji, *rrijptr—--[82] ;store rr[ij [Jj] 

LDH «D *h2decptrt++,h2deck+1 jload h2[k+1+dec] 

SMPYHL h2k_h2k+1,h2deck+1,h2kk1 ;smpy (h2 [k+1],h2[k+1+dec] ) 

SADD L, s,h2kkl,s ;sadd(s, smpy (h2 [k+1],h2[k+1+dec] ) 

SADD s,0x8000L, sround ; round (s) <<16 

LDH D *signiptr--, signi-1 ;load sign[i-1] 

SMPY signi-1,signj_signj-1,signijl;smpy (sign[i-1],sign[j-1]) 

SMP YH x signijl,sround, rrjil ;L_mult (round(s),mult (sign[i-1],sign[j-1])) 

SHR ee rrjil, l6yer ql per lgct) [es] 

STH -D rrjlil, *rrjlilptr—- [82] ;store rr[j-1] [i-1] 

STH .D rrjlil, *rriljlptr—- [82] ;store rr[i-1] [j-1] 
{icntr]SUB.ALUicntr,2,icntr ;decrement inner loop counter 
{icntr]B .S INNERLOOP ;branch to loop 


To avoid memory bank hits: 


Lj Load words (h2[k], h2[k+1]) and (sign[i—1], sign[i]) together and allo- 
cate h2 and sign so that they are aligned with each other. 


J Store rr[j][i] and rr[j—1][i-1] together and rr[i][j] and rr[i—1][j-1] to- 
gether. 


There are five load/store pairs, so each iteration requires only five cycles. You 
gain speed by eliminating both the memory bank hits, as well as by reducing 
the cycles required to complete each rr. 


The final assembly code with reduced code size is shown in Example A-21. 
Here, the primitive technique introduced in section 5.4.3.2, Priming the Loop, 
on page 5-27 is used to reduce the code size for both the prolog and epilog 
of the inner loop. 
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Example A-21. Assembly Code for cor_h With Reduced Code Size 
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Texas Instruments, Inc 
Implementation of cor_h in EFR 
Compute four rrs at a time 
Total cycles = 2533 


Register Usage: 


16 


SUB Ti A4,1,A13 ; 
1 | ADDK .Sl 76,A6 i 
l| ADDK +52 3360,B6 i‘ 
LI SUB .D1 A4,1,A2 ; 
MVK .S2 0,B2 ; 
|| ADD .L2 B4,2,B13 ; 
|| MVK .S1 2,All ; 
P 
OUTERLOOP : 
LDW .D1 *A6,A10 i 
LDW .D2 *B4,B12 ; 
ADD .L1X  B13,2,A3 ; 
SUB .S1 A6,A11,A4 ; 
[A2] ADD L2X  A2,2,B0 ; 
MPY 1 A13,A11,A3 
MPY 2 B11,0,B11 ; 
LDH D2 *B134++[2],A7 ; 
LDH D1 *A3,B7 ; 
ADD .L2X  A4,2,B9 ; 
MV .S2 B6,B14 ; 
SUB .L1 A6,4,A8 ; 
[B2] ADDK .S1 -164,A14 ; 


B 


15 


A4 --- L_CODE 
B4 -—- &h2[0] 

A6 --- &sign[0] 
B6 --- &rr[0] [0] 


used to obtain érr[i][j] 
&Sign[L_CODE-2] 

&rxr [L_CODE-1] [L_CODE-2]+[82]=&rr[j] [i]+[82] 
outer loop counter 


and érr[i-1] [5-1] 


not doing the initial store 

&h2 [k+dec] 

used to increase/decrease the pointers 
for h2 and sign 


load sign[j-1] & sign[j] 
load h2[k] & h2[k+1] 
&h2[k+dect1] 

&sign[i-1] 


define the inner loop counter 
initialize s 


load h2[ 
load h2[ 
&sign[i] 
érr[j] [i]+[82] 
é&sign[j-3] 

from &rr[dec] [0]+[82] 


k+dec] 
k+dec+1] 


to &rr[dec] [0] 
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Example A—21. Assembly Code for cor_h With Reduced Code Size (Continued) 


B2] 
[B2] 


BO] 

[A2] 
[A2] 
[B2] 


[BO] 


NNERLOOP : 


[!B1] 


[Al] 


LDH 
LDH 
SMP YHL 
SMP YLH 
ADDK 
ADDK 


-D2 
2D1 
~L2 
ra ral 
82 
Sl 


-D1 
ol 
-L1X 
~S2 
~L2X 
-D2 


~S2 
Sl 
~L2X 
-D1 
ede 
-D2 


~S2 
-M1 
~L2X 


éDI 
~D2 
col 
-L1X 


D1 
-D2 
-M2 
-M1X 
-S2 
-L1 
~L2X 


D2 
-D1 
-M1 
-M2 
Sl 
~S2 


*B9, A0 
*B4--[2],B5 
B4,4,B8 
All,2,A11 
-82,B14 
3,Al 


Al2,*A14 
-164,A9 
B6,A14 
BO,1,B0 
B14, A3,B3 
B6,2,B6 


INNERLOOP 
A2,1,A2 
A2,1,B2 
A12,*A9 
A14,A3,A9 
BO,B1 


B9,16,B10 
A3,A0,A3 
B11,A15,B9 


*A8——, A10 
*B8++, B12 
OUTERLOOP 
B13,2,A3 


*A3,B7 
*B13++[2],A7 
B9,B5,B9 
A7,B12,A7 
B1,1,B1 
Al,1,Al 
B4,2,B9 


*B9,A0 
*R4--[2],B5 
A10,A0,A0 
B7,B12,B7 
-164,A14 
-164,B14 


load sign[i] 

load sign[i-1] 

&h2[k+2] 

update All 

érr[j] [i] 

determine when the stores in the inner loop 
actually starts 


store rr[dec] [0] 

from &rr[0] [dec]+[82] 
é&rr[j] [1]+[82] 

inner loop counter 
grr [i-1] [4-1] 
é&rr[j] [i-1], 
outer loop iteration 


tp &rr[0] [dec] 


for the next 


decrement outer loop counter 

decide if the last store is needed 
store rr[0] [dec] 

érr[i] [j]+[82] 

counter for branching to outer loop 


## obtain rr[j-1] [i-1] 
# smpyh(sadd(s,0x8000L),smpy(sign[i],sign[j])) 
# sadd(s, 0x8000L) 


*load sign[j] & sign[j-1] 
*load h2[k] & h2[k+1] 
outer LOOP 


&h2 [k+dect1] 


*h2 [kt+dect1] 

*h2 [kt+dec] 

# smpyh(sadd(s,0x8000L),smpy (sign[i-1],sign[j-1]) ) 
smpy (h2[k] ,h2 [k+dec] 

decrement the counter for branching to the outer loop 
decrement the inner loop 

é&sign[i] 


*load sign[il] 

*load sign[i-1] 

depen ery eee 

smpy (h2[k+1],h2[k+1l+dec]) 

## from ceri) se ee to &rr[j] [il 

## from &rr[j-1][i-1]+[82] to &rr[j-1] [i-1] 
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Example A-21. Assembly Code for cor_h With Reduced Code Size (Continued) 


'A1] 
‘Al 


BO] 


'A2] 


FINISH: 


ST 
ST 


SADD 


SM 
SU 


PY 


NOP 


-D1 
-D2 
sf 
M2 
~L2 
ope 
S52 


Al2,*A14 ; ## store rr[j] [i 
B10, *B14 ; ## store rr[j-1] [i-1 

xX B11,A7,A5 7; S = sadd(s,smpy (h2[k],h2[k+dec] ) 

x Al10,B5,B5 ; smpy(sign[i-1],sign[j-1] 
BO, 1,B0 ; decrement inner loop counter 
—-164,A9 ; ## from &rr[i][j]+[82] to rrf[i] [jl 
-164,B3 ; ## from &rr[i-1][j-1]+[82] to &rr[i-1] [j-1] 
Al2,*A9 ; ## store rr[i][j 
B10, *B3 ; ## store rr[j-1] [i-1 
A3,16,A12 ; # obtain rr[j-1] [i-1 
A5,A15,A3 ; sadd(s,0x8000L) 

x A5,B7,B11 7 Ss = sadd(s,smpy (h2[k+1],h2[k+dec+1] 
INNERLOOP ; end of INNERLOOP 

x B4,A11,B13 ; &h2[k+dec] 
FINISH ; exit 
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The value of s is represented by both B11 and A5 to avoid two .L1 or two .L2 
units occurring in the same execute packet. Due to the dependence on s, as 
well as the removal of memory bank hits, it takes 20 cycles for each iteration 
of the modified C code. The pound sign (#) in the comments indicates that, 
each time the outer loop enters the inner loop, this instruction is not executed 
(or that the result of this instruction is not useful) until the number of iterations 
denoted by # has occurred. 


The code size is 11 fetch packets (352 bytes). Without applying the primitive 
technique, the code size will be at least four fetch packets more than the code 
shown in Example A-21. 


You can squeeze the instruction 
ADD .L2X B4,A11,B13 ; &h2[k+dec] 


into the inner loop to save about 1.5% of the cycle counts, with an increase in 
program memory of one fetch packet. 


A.2.4 


Implementation of the GSM EFR Vocoder 


Implementation of the rrv Computation in search_10i40 


Example A-22 shows the implementation of the rrv computation in search_10i40. 


Example A-—22. C Code for the rrv Computation in search_10i40 


#define L_CODE 40 

#define STEP 5 

#define _1_16 (Word16) (32768L/16) 
#define _1 8 (Word16) (32768L/8) 
input: 


Word1l6 rr[L_CODE 


local variables/arrays: 
Word1l6 rrv[L_CODE]; 


Wordl6 1i10,11,12,13,14,15,16,17,18,19; 


Word32 s; 


(The values of iO, i1, i2, i3, i4, i5, i6, and i7 were obtained before entering this loop.) 


Original C code 


/* defined on 


[0, L_CODE-1] 


for (i9 = 


ipos[9]; 19 < L_CODE; i9 += STEP) 

{ 

s = Lmult (rr[i9][i9], _1_16); 

s = Lmac (s, rr[i0] [i9 ee ee 8 

s = Lmac (s, rr[il][i9 _1_8); 

s = Lmac (s, rr[i2][i9 _1_8); 

s = Lmac (s, rr[i3][i9 at 8 ys 

s = Lmac (s, rr[i4] [i9 _1_8); 

s = Lmac (s, rr[i5d][i9 fee ee 

s = L_mac (s, rr[i6] [i9 _1_8); 

s = L_mac (s, rr[i7][i9 _1_8); 

rrv[i9] = round (s); 
} 
where L_mult(a,b) = _smpy (a,b) 

L_mac(a,b,c) = _sadd(a,_smpy (b,c) ) 
and round(a) = _sadd(a,0x8000L) >>16 


The instructions for one loop iteration are shown in Example A-23. 
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Example A-—23. Linear Assembly for the rrv Computation in Search_10i40 
(One Loop Iteration) 


LOOP: 
LDH .D *rr9ptrt++[205],rr99 ;load rr[i9] [i9 
SMPY . CY997 1167-8 ;s=L_mult (rr[i9] [i9],_1_16) 
LDH .D *rrOptrt++[5],rr09 ;load rr[id] [i9 
SMP Y , rr09,_1_8,s0 ;L_mult (rr[i0] [i9],_1_8) 
SADD sili s,s0,s ;s=L_mac(s,rr[i0] [i9],_1_8) 
LDH .D Arr IpErEe [5], Ler j;load rr[il] [i9 
SMPY : rr19,_1_8,s1 ;L_mult (rr[il] [i9],_1_8) 
SADD ahi SS15-S ;s=L_mac(s,rr[il] [i19],_1_8) 
LDH -D *rr2ptrt++[5],rr29 jload rr[i2][i9 
SMPY rr29,_1_8,s2 ;L_mult (rr[i2] [i9],_1_8) 
SADD .L Sy S2y5:S ;s=L_mac(s,rr[i2] [i9],_1_8) 
LDH .D *rr3ptrt++[5],rr39 yload- Fr ia3) [19 
SMPY . rr39,_1_8,s3 ;L_mult (rr[i3] [i9],_1_8) 
SADD .L SyS3;°S ;s=L_mac(s,rr[i3] [i9],_1_8) 
LDH 2D *rr4ptrt++[5],rr49 ;jload rr[i4] [i9 
SMPY . rr49,_1_8,s4 ;L_mult (rr[i4] [i9],_1_8) 
SADD afl s,s4,s ;s=L_mac(s,rr[i4] [i9],_1_8) 
LDH .D *rroptrt+t+[5],rr59 ;load rr[i5] [i9 
SMP Y . rr59,_1_8,s5 ;L_mult (rr[i5] [i9],_1_8) 
SADD fli SyS0-S ;s=L_mac(s,rr[i5] [i9],_1_8) 
LDH .D *rr6optrt++[5],rr69 ;load rr[i6] [i9 
SMP Y . rr69,_1_8,s6 ;L_mult (rr[i6] [i9],_1_8) 
SADD sah s,s6,s ;s=L_mac(s,rr[i6] [i9],_1_8) 
LDH rae *rrJptrt++[5],rr79 ;load rr[i7] [i9 
SMPY . rr79,_1_8,s7 ;L_mult (rr[i7] [i9],_1_8) 
SADD L SyS:1 7S ;s=L_mac(s,rr[i7] [i9],_1_8) 
SADD aly s,0x8000L, sround ; round(s) 
SHR S sround, 16, rrv9 Prev). 5) 
STH D rrv9, *rrv9ptrt++[5] ;store rrv[i9 
{icntr]SUB . ALU icntr,1,icntr ;decrement inner loop counter 
{icntr]B Ss LOOP ;branch to loop 
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The following table shows the pointers In Example A—23 and the arrays they 
point to. 


Pointer for array 
rr9ptr rr[i9][i9] 
rrOptr rr[id0][i9] 
rriptr rit] [ig] 
rr2ptr r[i2][i9] 
rrgptr r[i3][ig9] 
rr4ptr r[i4][ig9] 
rr5ptr r[id5][i9] 
rr6ptr rr[i6][i9] 
rr7ptr rr[i7][i9] 
rrv9ptr rrv[i9] 


The .D unit is used the most (ten times per iteration). Although these instruc- 
tions can be arranged in five cycles, any combination of the load hits the same 
memory bank, Because any two values loaded are exactly 40 halfwords apart. 
It still takes ten cycles for one rrv. 


Applications Programming A-29 


Part IV 


Part IV 


Implementation of the GSM EFR Vocoder 


Next, consider unrolling the inner loop once. The C code is shown i 


Example A-24. 


Example A-—24. C Code for the rrv Computation in search_10i40 (Unrolled Loop) 


for 


{ 


(Eg. = 


ANNHANNNUANUANNANHA NYA WN YD 
ll 


= L_mac 
rrv[i9] = 
rrv[i9+5] 


[9]; i9 < L_CODE; i9 += 2*STEP) 
(rr[i9][i9], _1_16); 
(er [9455] [OHS] po ll 6) ss 

(s, rr[i0][i9], _1_8); 

(S, rr[i0][i9+5], _1_8); 

(s, rr[il][i9], _1_8); 

(S, rr[il] [i9+5], _1_8); 

(Sf -CE(22) ivy -18)7 

(S, rr[i2][i9+5], _1_8); 

(3° BY (L3:) (9 I. 28973 

(S, rr[i3][i9+5], _1_8); 

(s, rrv[i4][i9], _1_8); 

(S, rr[i4][i9+5], _1_8); 

(Sy - CE (5) LO 8) + 

(S, rr[i5][i9+5], _1_8); 

(s, rr[i6][i9], _1_8); 

(S) PE Ele] [L9+5 17> 218); 

(Sp BELLI LA, 183 

(S, rr[i7] [i9+5], _1_8); 

round (s); 

= round (S); 
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Example A-25 shows the instructions for each iteration. 
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Example A—25. Linear Assembly for rrv Computation in search_10i40 (One Loop Iteration) 


LOOP: 
LDH -D *rr9ptr++[410],rr99 jload rr[i9] [i9 
SMPY ‘ rr99;. 1:16; 8 ;s=L_mult (rr[i9] [i9],_1_16) 
LDH .D *rr95ptrt++[410],rr995 ;load rr[i9t+5] [19+5] 
SMPY : rr995,_1_16,S 7 S=L_mult (rr[i9+5] [i9+5],_1_16) 
LDH .D *rrOptr++[10],rr09 ;load rr[i0d] [i9 
SMPY ‘ rr09,_1_8,s0 ;L_mult (rr[i0] [19],_1_8) 
SADD L s,s0,s ;s=L_mac(s,rr[i0] [i9],_1_8) 
LDH wl. *rrO05ptr++[10],rr095 ;load rr[i0][i9+5 
SMP Y : €rO095, 1 8,S0 Smite ee 0.) ote] 4 2 By 
SADD shy S$, S078 ,;S=L_mac(S,rr[i0] [19+5],_1_8) 
LDH -D *rrlptr++[10],rr19 ;load rr[il] [i9 
SMP Y : rerio, 1-8) sa: y,L_mult (rr[il] [i9],_1_8) 
SADD -L S7sk-s ,Ss=L_mac(s,rr[il] [i9],_1_8) 
LDH -D *rrl5ptr++[10],rr195 ;load rr[il][i9+5 
SMP Y F rr195,_1_8,S y,L_mult (rr[il] [i19+5],_1_8) 
SADD wD S,;S1;S ;S=L_mac(S,rr[il] [19+5],_1_8) 
LDH -D *rr2ptr++[10],rr29 j;load rr[i2] [i9 
SMP Y * rr29,.~ 18:82 y,L_mult (rr[i2] [i9],_1_8) 
SADD .L S;,S2)8 ;s=L_mac(s,rr[i2] [i9],_1_8) 
LDH -D *rr2ptr++[10],rr295 j;load rr[i2] [i9+5 
SMP Y ; C295). - 2 Bi Se yL_mult (rr[i2] [i9+5],_1_8) 
SADD .L S7S27S ;S=L_mac(S,rr[i2] [i9+5],_1_8) 
LDH -D *rr3ptr++[10],rr39 j;load rr[i3] [i9 
SMPY . rr39,_1_8,s3 y,L_mult (rr[i3] [i9],_1_8) 
SADD ely S3183;.8 ;s=L_mac(s,rr[i3] [i9],_1_8) 
LDH -D *rr3ptr++[10],rr395 j;load rr[i3] [i9+5 
SMP Y r E395, 1. Sy Ss i mle ee las aor] yd Sy 
SADD «Li $7:53;5 ;S=L_mac(S,rr[i3] [i9+5],_1_8) 
LDH PB) *rr4ptr++[10],rr49 j;load rr[i4] [i9 
SMP Y 7 rr49,_1_8,s4 ;L_mult (rr[i4] [i19],_1_8) 
SADD el s,s4,s ;s=L_mac(s,rr[i4] [i19],_1_8) 
LDH -D *rr4ptr++[10],rr49 j;load rr[i4] [i9 
SMP Y rr49,_1_8,S4 ;L_mult (rr[i4] [i19],_1_8) 
SADD .L S,S4,S ,S=L_mac(S,rr[i4][i9],_1_8) 
LDH 0) *rroptrt+ (| 10),27r59 jload rr[i5d] [i9 
SMPY : E59; 1 8s ;L_mult (rr[i5] [i9],_1_8) 
SADD -L s,s5,s ;s=L_mac(s,rr[i5] [i9],_1_8) 
LDH -D *rrSptr++[10],rr595 j;load rr[i5d] [i9+5 
SMP Y ‘ CrSO5;, 1-8; 85 2b male (rr [a5] [2945] 42) 8) 
SADD apie SS. 8 ;S=L_mac(S,rr[i5] [i9+5],_1_8) 
LDH “D *rroptrt+[10],rr69 ;load rr[ié] [i9 
SMPY rro9; 7 38'/s6 ,L_mult (rr[i6] [i9],_1_8) 
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Example A—25. Linear Assembly for rrv Computation in search_10i40 (One Loop Iteration) 


(Continued) 
SADD .L s,s6,s ;s=L_mac(s,rr[i6] [i9],_1_8) 
LDH .D *rroptr++[10],rr695 ;load rr[i6] [i9+5 
SMP rr695,_1_8,S6 ;L_mult (rr[i6] [19+5],_1_8) 
SADD dy S$, o6,5 ;S=L_mac(S,rr[i6] [i19+5],_1_8) 
LDH .D *rr/ptrt++ [10],rr79 ;load rr[i7] [i9 
SMPY ELT pol 387-87 ;L_mult (rr[i7] [i9],_1_8) 
SADD ~L 3, S718 ;s=L_mac(s,rr[i7] [i9],_1_8) 
LDH .-D *rr7ptr++[10],rr795 ;load rr[i7] [i9+5 
SMP Y Pry 95, 2b 387S7 ;L_mult (rr[i7] [19+5],_1_8) 
SADD L S,S7,S ;S=L_mac(S,rr[i7] [i9+5],_1_8) 
SADD ecb s,0x8000L, sround ; round(s) 
SHR pe) sround, 16,rrv9 ;rrv[i9) 
STH .D rrv9, *rrv9ptr++[10] ;store rrv[i9 
SADD ei S,0x8000L, Sround ; round (S) 
SHR Ss Sround, 16, rrv95 ;rrv[i9t5] 
STH rrv95,*rrv95ptr++[10] ;store rrv[i9t5] 
[icntr]SUB .ALU icntr,2,icntr ;decrement inner loop counter 
[icntr]B eo) INNERLOOP ;branch to loop 
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The following table shows the pointers In Example A—25 and the arrays they 
point to. 


Pointer for array 
rr9ptr and rr95ptr rr[i9][i9] andrr[i9+5][i9+5] 
rrxptr and rrx5ptr re[ix][i9] and rr[ix][i9+5] (where x =0, 1,..., 7) 


rrv9ptr and rrv95ptr rrv[i9] and rrv[i9+5] 


Again, the .D unit is used the most (twenty times per iteration). 


None of the pairs of rr[ix][i9], rr[iy][i9+5] hit the same memory bank (where 
ix, ly = 0, i1, ..., 17). The same is true for pairs rrv[i9], rrv[i9+5], as well as 
for rr[i9][i9] and rr[i9+5][i9+5]. For ease of understanding: 


Lj) Load rr[ix][i9], rr[ix][i9+5] together. 
Lj Load rr[i9][i9], rr[i9+5][i9+5] together. 


LJ Store rrv[i9], rrv[i9+5] together. 


In this way, each iteration takes ten cycles without any memory bank hits. You 
double the speed by unrolling the loop once. 


The final assembly code is shown in Example A—26. 


Implementation of the GSM EFR Vocoder 


Example A—26. Assembly Code for the rrv Computation in search_10i40 


KKK KKK KKK KKK KKK KKK KKK KKK KKK KKK KKK KKK KKK KKK KK KKK KKK KKK KKK KKK KKK KKK KKK KEKKKKKKK KKK KKK 
as Texas Instruments, Inc Re 
k* K* 
eH, Implementation of the rrv Computation in search_10i40 in EFR Bia 
k* k* 
is Compute two rrvs a time a. 
TR k* 
ee Total cycles = 55 Ae 
K* k* 
es Register Usage: A B a 
k* k* 
ae 16 14 Be 
K* k* 
KKK KKK KKK KKK KKK KKK KKK KKK KKK KKK KKK KKK KKK KKK KKK KKK KKK KKK KKK KKK KK KKK KKK KKKKKKKK KKK KKK 
; B4 --- i0 
; BS --- il 
; BO --- i2 
Ba Sa 23 
; AS --- i4 
; BY --- 15 
BO Bole) er EG 
p el ==) 17. 
}. BS. See “8 
; A1S --- &rr[0] [0] 
; AO --- &rrv[0] 
; B14 --- stack pointer 
MVK Sel 410,A2 ; offset of rr[i9][i9] 
l| MVK 382 410,B2 ; offset of rr[i9+5] [i9+5] 
MVK ~S2 82,B0 
MPYU -M2 B3,B0,B3 pr [297 [29] 
SHL -S1X B3,1,A13 
SUB -L2 BO,2,B0 ; 80 
ADD ~S2X A15,B2,B13 ; &rr[5] [5] 
MPYU ~M2 B4,B0,B4 Pr TLOLES 
ADD -L2X A15,10,B15 ; &rr[0] [5] 
MVK io 80,Al1 
MPYU -M2 B5,B0,B5 hE Le 
ADD ~L1xX B3,A15,A3 § €@rr[v9] [29] 
ADD -L2 B3,B13,B3 ; &rr[i9+5] [19+5] 
ADD soul A15,A13,A15 ; &xrxr[0] [i9] 
ADD ~S2X B15,A13,B15 ; &rr[0] [i9+5] 
MPYU -M1 A10,A1,A10 ? [Hele 
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Example A-26. Assembly Code for the rrv Computation in search_10i40 (Continued) 


MP YU .M2 B6,B0,B6 
MP YU .M1 Al11,A1,A11 
ADD .L1X B4,A15,A4 
ADD .L2 B4,B15,B4 
LDH D1 *A3++[A2],A13 
LDH .D2 *B3++[B2],B13 
ADD .S1 AO,A13,A0 
MP YU .M2 B7,B0,B7 
MP YU .M1 A8,A1,A8 
ADD 7 ETX B5,A15,A5 
ADD .L2 B5,B15,B5 
ADD fcuk A10,A15,A10 
ADD ~S2X A10,B15,B10 
LDH Peeae *A4++[10],A13 
LDH DZ *B4++[10],B13 
MP YU .M2 B9,BO,B9 
ADD .L1X B6,A15,A6 
ADD .L2 B6,B15,B6 
ADD Honk Al11,A15,A11 
ADD ~S2X Al11,B15,B11 
LDH .D1 *A5++[10],A13 
LDH .D2 *B5++[10],B13 
ADD .L1X B7,A15,A7 
ADD #2 B7,B15,B7 
ADD .S1 A8,A15,A8 
ADD ~S2X A8,B15,B8 
LDH D1 *A6++[10],A13 
LDH .D2 *B6++[10],B13 
ADD L1xX B9,A15,A9 
ADD L2 B9,B15,B9 
LDH D *A7++[10],A13 
LDH D2 *B7,B13 
MV. S2 2048,B7 
LDH DL *A8++[10],A13 
| LDH .D2 *B8++[10],B13 
| SMPY .M1X A13,B7,A12 
| SMPY M2 B13,B7,B12 
| SHL $2 B7,1,B7 
| ADD .L2X AO,10,B0 


’ 


[i2] [0] 

[i7] [0] 

&rr[i0] [i9] 
&érr[i0] [19+5] 
load rr[i9][i9] 
load rr[i9+5] [i9+5] 
&rrv[i9] 

[i3] [0] 

[i4] [0] 

érr[il] [29] 
&rr[il] [i9+5] 
&rr[i6] [9] 
&rr[i6] [i9+5] 


load rr[i0] [i9] 
Load rr[i0] [19+5] 


[i9] [0] 
&rr[i2] [i9 
&rr[i2)] [i9+5] 
&rr[i7] [i9 
&rr[i7] [i9+5] 


load rr[il][i9] 
Load xr [il] [29+5] 


err (23) [29 
&rxr[i3] [i9+5] 
&rxr[i4] [i9 
&rr[i4] 


19+5] 
load rr[i2][i9] 
load rr[i2] [i9+5] 


err[es) [29 
&érr[i5] [19+5] 
load rr[i3][i9] 
load xrr[i3)] [19+5] 
pol ALG 


load rr[i4][i9] 

load rr[i4] [i9+5] 

s=smpy (rr[i9] [i9],_1_16) 

S=smpy (rr[i9+5] [i9+5],_1_16) 
18 


&rrv[i9t5] 
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LOOP: 


nAnNrH 


Cae -@). Eb et 


O Oo 


O Oo 


-D1 
-D2 


-M1X 


*A9++[10],A13 
*B9++[10],B13 
A13,B7,A15 
B13,B7,B15 


*A10++[10],A13 
*B10++[10],B13 
A13,B7,A15 
B13,B7,B15 


*A11++[10],A13 
*B11++[10],B13 
A13,B7,A15 
B13,B7,B15 
A12,A15,A12 
B12,B15,B12 
3,Al 


A13,B7,A15 
B13,B7,B15 
A12,A15,A12 
B12,B15,B12 
32767,A14 


A12,A15,A12 
B12,B15,B12 
A13,B7,A15 
B13,B7,B15 
*A3++[A2],A13 
*B3++[B2],B13 
Al4,1,A14 


A13,B7,A15 
B13,B7,B15 
A12,A15,A12 
B12,B15,B12 
*A4++[10],A13 
*B4,B13 


A13,B7,A15 
B13,B7,B15 
A12,A15,A12 
B12,B15,B12 
*A5++[10],A13 
*B5++[10],B13 


, 


, 


, 


, 
, 
, 
, 
, 


, 


p*® Toad rr[ie+5 


pe load. fr [i 
;* lead 2x Tt 


load rr[i5][i9 
load rr[i5] [i9+5] 
s0O=smpy (rr[i0] 
SO=smpy (rr[i0] 
load rr[i6] [i9 
load rr[i6é] [i9+5] 
sl=smpy (rr[il] 
Sl=smpy (rr[il] 


load rr [19 


[i7 
load rr[i7] [i9+5] 
(rr 


s2=smpy i2] 
S2=smpy (rr[i2] 
s=sadd(s,s0) 
S=sadd(S,S0) 
loop counter 


19) 721_8) 
i9+5],_1_8) 


i9],_1_8) 
i9+5],_1_8) 


i9],_1_8) 
19+5)], 1-8) 


s3=smpy (rr[i3] [i9],_1_8) 
S3=smpy (rr[i3] [i9+5],_1_8) 


s=sadd(s,s1) 
S=sadd(S,S1) 


s=sadd(s,s2) 
S=sadd(S,S2) 
s4=smpy (rr[i4 
S4=smpy (rr[i4 


;* load rr[i9] [i9] 


i9],_1_8) 
i9+5],_1_8) 


19+5] 


32768 for rounding 


s5=smpy (rr[i5] [i9],_1_8) 
S5=smpy (rr[i5] [i9+5],_1_8) 
s=sadd(s,s3) 

S=sadd(S,S3) 

* load rr[i0] [19] 

* Load. e440) (29+5] 


s6=smpy (rr[i6] [i9],_1_8) 
S6=smpy (rr[i6] [i9+5],_1_8) 


s=sadd(s,s4) 
S=sadd(S,S4 


1 
1 


) 
][i9 
][i9+5] 


] 
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Example A-26. Assembly Code for the rrv Computation in search_10i40 (Continued) 


A13,B7,A15 
B13,B7,B15 


Al2,Al 
B12,Bl1 
*AGt++ 
*B6t+t+ 


5,A12 
5,B12 
10],A13 
10],B13 


A7,10,B7 


Al12,Al 
B12,B] 
*AT++ 


*BT7,Bl 


5,A12 
5,B12 
10],A13 
3 


SMP Y .M1X 
SMP Y M2 
SADD -L1 
SADD L2 
LDH D1 
LDH D2 
ADD S2X 
SADD L 
SADD L2 
LDH D 
LDH D2 
MVK ~S2 
[Al] B esol 
SADD aA) 
SADD L2 
LDH aL 
LDH -D2 
SMPY -M1X 
SMPY M2 
SHL ~S2 
[Al] SUB ane 
SADD -LL 
SADD ~L2X 
LDH -DL 
LDH -D2 
SMP Y -M1X 
SMPY M2 
SHR sol 
SHR -S2 
SMPY -M1X 
SMP Y M2 
LDH -DL 
LDH -D2 
SMP Y .M1X 
SMP Y M2 
SADD pe rae 
SADD ~L2 
LDH -DL 
LDH -D2 


2048,B7 
LOOP 


A12,A15,A12 
B12,B15,B12 
*A8++[10],A13 
*B8++[10],B13 
A13,B7,A12 
B13,B7,B12 
B7,1,B7 
Al,1,Al1 


A12,A14,A14 
B12,A14,B4 
*A9++[10],A13 
*B9++[10],B13 
A13,B7,A15 
B13,B7,B15 


A14,16,A14 
B4,16,B4 
A13,B7,A15 
B13,B7,B15 
*A10++[10],A13 
*B10++[10],B13 


A13,B7,A15 
B13,B7,B15 
A12,A15,A12 
B12,B15,B12 
*A11++[10],A13 
*B11++[10],B13 


’ 


’ 


’ 


’ 


’ 


’ 


’ 


, 
’ 


’ 


s7=smpy (rr[i7] [i9],_1_8) 
S7=smpy (rr[i7] [i9+5],_1_8) 
s=sadd(s,s5) 

S=sadd(S,S5) 


* load rrl[ 
* load rr[ 


12] [i9] 
12] [i9+5] 


-* Load se Th 


&rxr[i3] [19+5] 


s=sadd(s,s6) 
S=sadd(S,S6) 
* load rr[i3] [i9] 
3] [19+5] 
ne? AG 

branch to the loop 


s=sadd(s,s7) 

S=sadd(S,S7) 

* load rr[i4)] [9] 
* load rr[i4] [i9+5] 

* s=smpy (rr[i9] [i9],_1_16) 

* S=smpy (rr[i9t+5] [i19+5],_1_16) 
_1_8 

decrement loop counter 


round(s 
round (S 
* load rr 
* load rr 
* s0=smpy 
* SO0=smpy 


rrv[i9] 

rrv[i9+5] 

* sl=smpy (rr[il] [i9],_1_8) 

* Sl=smpy (rr[il] [19+5],_1_8) 

* load rr[i6] [i9 
[i6 


* load rr [i9+5] 


* s2=smpy (rr[i2] [i9],_1_8) 

* S2=smpy (rr[i2] [i9+5],_1_8) 
* s=sadd(s,s0) 

* S=sadd(S,S0) 

* load rr[i7] [29 

* load rr[i7] [i9+5] 
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STH -D1 
| | STH -D2 
| | SMPY -M1X 
| | SMPY M2 
I] SADD L1 
| | SADD -L2 
I] ADD ~S2X 
| | MVK -S1 


Al14, *A0++[10] 
B4, *B0++[10] 
A13,B7,A15 
B13,B7,B15 
A12,A15,A12 
B12,B15,B12 
A4,10,B4 
32767,A14 


store rrv[i9] 
store rrv[i9t5] 

* s3=smpy (rr[i3] [i9],_1_8) 
S3=smpy (rr[i3] [i9+5],_1_8) 
s=sadd(s,s1) 

S=sadd(S,S1) 
é&rr [id] [i9+5] 
end of LOOP 


Because of the shortage of registers: 


.1 B7 serves as_1_16,_1_8 and as the pointer for rr[i3][i9+5]. 


Lj A114 represents Ox8000L as well as rrv[i9]. 


_j Bé4 serves both the value of rrv[i9+5] and the pointer to rr[i0][i9+5]. 


The last iteration of the loop can be expanded as the epilog of the loop to over- 
lap with the prolog of the code that follows this part of the code. 
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A.2.5 Implementation of the Index Search in search_10i40 


The index search in search_10i40 is the core of search_10i40. The C code is 
shown in Example A-27. 


Example A-—27. C Code for the Index Search for search_10i40 


#define L_CODE 40 

#define STEP 5 

#define _1 16 (Word16) (32768L/16) 
#define _1 8 (Wordl16) (32768L/8) 


input: 
Word1l6 rr[L_CODE] [L_CODE], ipos[L_CODE], dn[L_CODE]; 
local variables/arrays: 
Wordl6 rrv[L_CODE]; 
Word16 i0,i1,12,13,i4,i15,i16,i7,i18,i9; /* defined on [0,L_CODE-1] */ 
Wordl6 ia,ib; 
Wordl6 ps,ps0O,psl,ps2,sq,sq2; 
Word1l6 alp,alp_16; 
Word32 s,alp0,alpl,alp2; 


(The values of iO, 11, i2, i3, i4, i5, 16, i7 , psO, and alpO have 
been obtained before entering this loop.) 


Original C code 


sq = -l; 

alp = 1; 
ps = 0; 

ia = ipos[8]; 
ib = ipos[9]; 


/* initialize 10 indices for i8 loop (see i2-i3 loop) */ 
for (i8 = ipos[8]; i8 < L_CODE; i8 += STEP) 
{ 


psl = add (ps0, dn[i8]); 


alpl = L_mac (alp0O, rr[i8][i8], _1_128); 
alpl = L_mac (alpl, rr[i0][i8], _1_64); 
alpl = L_mac (alpl, rr[il][i8], _1_64); 
alpl = L_mac (alpl, rr[i2][i8], _1_64); 
alpl = L_mac (alpl, rr[i3][i8], _1_64); 
alpl = L_mac (alpl, rr[i4][i8], _1_64); 
alpl = L_mac (alpl, rr[i5][i8], _1_64); 
alpl = L_mac (alpl, rr[i6][i8], _1_64); 
alpl = L_mac (alpl, rr[i7][i8], _1_64); 
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} 


/* initialize 3 indices for i9 inner loop (see i2-i3 loop) 


for (19 = ipos[9]; 19 < L_CODE; i9 += STEP) 
{ 


ps2 = add (psl, dn[i9]); 


alp2 = L_mac (alpl, rrv[i9], _1_8); 
alp2 = L_mac (alp2, rr[i8][i9], 1-64)? 


sq2 = mult (ps2, ps2); 


alp_16 = round (alp2); 


s = L_msu (L_mult (alp, sq2), sq, alp_16); 


if (s > 0) f 


sq = sq2; 

ps = ps2; 

alp = alp_16; 
ia = i8; 

ib = i9; 


where 


and 


add(a,b) = _sadd(a<<16,b<<16)>>16 
L_mac(a,b,c) = _sadd(a,_smpy (b,c) ) 


mult(a,b) = _smpy (a<<16,b<<16)>>16 
L_mult (a,b) =_smpy (a,b) 

round(a) = _sadd(a,0x8000L) >>16 
L_msu (a,b,c) =_ssub(a,_smpy (b,c) ) 


if 


This is a typical example of the performance being limited by data dependency 
constraints. In this case, the dependency is between the values of alp and sq. 
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A.2.5.1 Rearranging the C Code 


To avoid the unnecessary shift, ps, ps1, ps2, alp, alp_16, sq, and sq2 are 
implemented as int (Word32) variables. The calculations are implemented as: 


Original Implemented as 

psl = add (ps0, dn[i8]); psl = _sadd(ps0O, dn[i8]<<16); 
ps2 = add (psl, dn[i9]); ps2 = _sadd(psl, dn[i9]<<16); 
sq2 = mult (ps2, ps2); sq2 = _smpyh(ps2,ps2); 

alp_16 = round(alp2); alp_16 = _sadd(alp2,0x8000L); 


There is no need to compute s explicitly. Instead of implementing the following 
sequence: 
s = Lmsu (L_mult (alp, sq2), sq, alp_16); 


if (s > Q) 
{ 


sq = sq2; 
ps = ps2; 
alp = alp_16; 
ia = i8; 
ib = i9; 


you can do this sequence to fulfill the same task: 


if(_smpyh(alp,sq2) > _smpyh(sq,alp_16)) { 


sq = sq2; 
ps = ps2; 
alp = alp_16; 
ia = 18; 
db. = 19% 


} 
A.2.5.2 Performance Analysis 


The instructions to execute one iteration of the inner loop are shown in 
Example A-28. 
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Linear Assembly for the Index Search for search_10i40 (Inner Loop) 


INNERLOOP: 
LDH 
SHL 
SADD 
SMP YH 
LDH 
SMPY 
SADD 
LDH 
SMPY 
SADD 
SADD 
SMP YH 
SMP YH 
CMPGT 
[cndr] MV 
[cndr] MV 
[cndr] MV 
[cndr] MV 
[cndr] MV 
[icntr]SUB 
[icntr]B 


.D *dn9ptrt++[5],dn9 ; load dn[i9] 

5 dn9,16,dn9h ; Odn[i9] << 16 

.L psi1,dn9h,ps2 7 ps2 = sadd(psl, dn[i9] << 16) 
ps2,ps2,sq2 7 sq2 = smpyh(ps2,ps2) 

-D errevptrt+ [5] ,-Lrv ; load rrv[i9] 

rrv,_1_8,tmpl ; smpy(rrv[i9], _1_8) 

elt alpl,templ,alp2 ; alp2=sadd(alpl,smpy (rrv[i9],_1_8) ) 

.D *rr89prtt++, rr89 ; load rr[i8] [i9] 

: rrv89,_1_64,tmp2 7 smpy(rr[i8] [i19],_1_64) 

.L alp2,tmp2,alp2 ; alp2=sadd(alp2,smpy (rr[i8] [i9],_1_64) ) 

.L alp2,0x8000L,alp_16 ; alp_16=sadd(alp2,0x8000L) 
alp,sq2,tmp3 7 smpyh(alp,sq2) 

: sq,alp_16,tmp4 7 smpyh(sq,alp_16) 

.L tmp3,tmp4, cndr ; if (smpyh(alp,sq2) > smpyh(sq,alp_16) ) 


. ALU sq2,sq 

. ALU ps2,ps 

- ALU alp_16,alp 

. ALU i8,ia 

. ALU i9,ib 

.- ALU rentr, 1,10ntr 

as) INNERLOOP ;branch to the loop 


Because both sq and alp are carried over and required from one iteration to 
the next, their values should be put in registers to allow speedy retrieval. At 
least four cycles are required to compute new sq and alp values, and the 
requirement on the functional units does not exceed four execution packets. 
Therefore, the inner loop can be effected in four cycles per iteration. 


For the outer loop, any pair of rr[ix][i8], rr[iy][i8] (where ix, iy =i0, i1, ..., i7) 
will definitely hit the memory bank if they are read together. Therefore, they 
should be loaded in one cycle each. 


A.2.5.3 Partitioning the Registers 


The total number of registers required for this code, including the registers for 
the pointer of the arrays, loop counters, intermediate results, etc., exceeds the 
number of registers available. To partition the registers without losing speed, 
the strategies are: 


[1 For the inner loop, store the results of ps, ia, and ib, whose values are not 
used in this code. 


_j For the outer loop, store the pointers of arrays starting at rr[i5][i8], 
rr[i6][i8], and rr[i7][i8], whose values are needed last in the outer loop. 
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Assume that before entering this code, &dn[0], &ipos[O], &rr[O][O], &rrv[Oj[O], 
i0, i1, i2, i3, i4, i5, i6, 17, psO, and alpO are known. Assume that the short 
(Word16) integers are stored in the stack in the order iO, 1, i2, i3, i4, i5, i6, i7, 
ia, and ib, and that a pointer &local_16[0], pointing to i0, is also known. The 
int integers and the pointers of the rr arrays are stored in the stack in the order 
ps0, ps, alp0, alo1, &rr[i5][i8], &rr[i6][i8], and &rr[i7][i8]. The pointer, 
&local_32[0], pointing to psO, is known as well. 


The C code is shown in Example A-29. 


Example A-29. Modified C Code for the Index Search 


for (19 = ipos[9]; i9 < L_CODE; i9 += STEP) { 


ps2 = _sadd(psl, dn[i9]<<16); 
alp2 = _sadd(local_32[3], _smpy(rrv[i9], _1_8)); 
alp2 = _sadd(alp2, _smpy(rr[i8][i9], _1_64)); 


sq2 = _smpyh(ps2, ps2); 


alp_16 = _sadd(alp2,0x8000L) ; 


sq = -l; 

alp = 1; 

local_32[1] = 0; 

local_16[8] = ipos[8]; 

local_16[9] = ipos[9]; 

/* initialize 10 indices for i8 loop (see i2-i3 loop) */ 

for (i8 = ipos[8]; i8 < L_CODE; i8 += STEP) { 

psl = _sadd (local_32[0], dn[i8]<<16); 

ocal_32[3 sadd(local_32[2], smpy (rr[i8][i8], _1_128) ) 
ocal_32[3] = _sadd(local_32[3], _smpy(rr[i0][i8], _1_64)); 
ocal_32[3] = _sadd(local_32[3], _smpy(rr[il][i8], _1_64)); 
ocal_32[3] = _sadd(local_32[3], smpy (rr[i2][i8], _1_64)); 
ocal_32[3] = _sadd(local_32[3], _smpy(rr[i3][i8], _1_64)); 
ocal_32[3] = _sadd(local_32[3], smpy (rr[i4][i8], _1_64)); 
ocal_32[3] = _sadd(local_32[3], smpy (rr[i5][i8], _1_64)); 
ocal_32[3] = _sadd(local_32[3], smpy (rr[i6][i8], _1_64)); 
ocal_32[3] = _sadd(local_32[3], _smpy(rr[i7][i8], _1_64)); 

/* initialize 3 indices for i9 inner loop (see i2-i3 loop) */ 
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Example A-—29. Modified C Code for the Index Search (Continued) 


if 


(_smpyh(alp,sq2) > _smpyh(sq,alp_16)) { 


sq = sq2; 
local_32[1]= ps2; 
alp = alp_16; 

local_16[8] 
local_16[9] 


18; 
194 


A.2.5.4 Final Assembly Code 


The final code consists of the following steps: 


Step 1: Load i0, i1, ... i9, alp0, and ps0; and initialize sq, ia, and ib. Part 
of the code overlaps that of the last iteration of the code in section 
A.2.4 on page A-27. 


Step 2: Obtain the pointer for the arrays started at rr[i0][i8], rr[i1][i8], 
.. Ei7] [18], er[i8][i9], rrv[i9], dn[i8], and dn[i9]. 


Step 3: Load rr[i0][i8], rr[i1][i8], ... rr[i7][i8] and dn[i8], compute 
the new ps1 and alp1, update the pointers, and store pointers 
&rr[i5][i8], &rr[i6][i8], and &rr[i7][i8]. 


Step 4: Loadrr[i8][i9], rrv[i9], and dn[i9]. Compute alp2, ps2, alp_16, 
sq2 and perform acomparison. Update the parameters ia, ib, alp, 
sq, and ps based on the comparison result. Repeat this step eight 
times. 


Step 5: Reload the values of psO and alpO, and &rr[i5][i8], &rr[i6][i8], 
and &rr[i7][i8]. Verify that step 3 has been repeated eight times. 
If not, go to step 3. If so, exit. 


To avoid memory bank hits, arrays rr and rrv must not be aligned on the same 
word or half-word boundary. The same applies to arrays rr and dn. As you can 
see in the final assembly code shown in Example A-30, there are several 
places that LDH (or STH) and LDW (or STW) occur in the same execution 
packet. They belong to one of the two categories; that is, always loading values 
from or storing values to the same memory locations, as in iterations like this: 


LDW .D1 *+A6[3],A11 ; load alpl 
| | [B2] STH .D2 B13,*+B6[9] ; store ib=i9 
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The following instructions are used in the inner loop in different memory locations 
such as the outer loop: 


[B2] STW el). B1l1,*+A6[1] ; store ps 
| | LDH .D2 *B10++[5],A5; load rr[i5] [i8] 


In the former case, memory bank hits can be completely eliminated by 
allocating the corresponding arrays in memory properly. Memory bank hits 
occur in every other iteration in the latter case, however. Although, in general, 
you should avoid writing such code, in this case, the performance of the prolog 
of the outer loop after the first iteration is limited by the .D unit. You still save 
some cycle counts in this example. 


To improve the performance, the last two iterations of the inner loop overlap 
part of the prolog of the outer loop. 


Example A-30. Assembly Code for the search_10i40 Index Search 


K* 


K* 


K* 


K* 


K* 


K* 


K* 


K* 


K* 


K* 


K* 


Texas Instruments, Inc ba 
K* 

Implementation of The Index Search in search_10i40 in EFR ee 
K* 

Total cycles = 400 (among the 400 cycles, 10 cycles are caused bales 
by memory bank hits) He 

K* 

Register Usage: A B ce 
K* 

15 15 He 


LDH 


LDH 
LDH 
MV 


KEKKKKKK KKK KKK KKK KKK KKK KKK KKK KKK KKK KKK KKK KKK KKK KKK KKK KKK KKK KK KKK KKK KKKKKKKKKKKKK KK 


KAEKKK KKK KKK KKK KKK KKK KKK KKK KKK KKK KKK KK KKK KKK KK KKK KKK KKK KKK KKK KKK KKK KKKKKKKKKKKKKKKK 


.D1 


D1 
“D2 
~S1X 


K* 


; A1l3 --- &ipos[0] and alp 

; B6 --- &local_16[0] 

; A6é --- stack pointer, point to &local_32[0] 

; B8 --- &rr[0] [0] 

; A4 --—- é&rrv[0] 

; B14 --—- &dn[0] 

; Bl --- reserved for the counter of the 

A outmost loop in search_10i40 
*+A13[8],A7 ; load i8 = ipos[8] 
*+A13[9],B13 ; load i9 = ipos[9] 
*B6,A13 ; load i0 
B6,A5 ; &local_v16[0] 
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LDH 
LDH 
MVK 


Ss 
< 
NN Do 


UO 
OO 


ae ae OO 
OO Ga 


-D2 
sbi 
Sl 


-L1X 


*+B6[2],B9 
*+A5[1],A14 
0,A8 


*+A5[4],A15 
*+B6[3],B10 
80,A0 
80,B0 


A8, *+A6[1] 
*4+B6[5],Bl1l 
A7,1,B10 
A7,A0,A12 


A7,*+A5[8] 
B13, *+B6[9] 
B8,B10,B2 
A13,B0,B3 


*A6,B15 
*+B6[6],Al 
A12,B2,A12 
B14,B10,B7 
B8,A12,B8 
A14,A0,A14 
B9,BO,B9 


*+A6[2],Al11 
*+B6[7],B5 
B13,B13,B12 
B3,B2,B3 


*A12,A5 
*B7++[5],B12 
B14,B12,B14 
A14,B2,A14 


*B3++[5],A5 
B9,B2,B9 
B10,A0,A9 


*A144+4+[5],A5 
A4,B12,A4 
A15,A0,A15 
B11,B0,Bll 


load i2 
load il 


could insert two .D 
units here for the store 


of rrv[i9+30] 


in the code which this piece 


and rrv[i9+35] 


immediately follows 


load i4 
load i3 


ps=0 

load i5 
[0] [i8] 
[i8] [0] 


store ia=i8 
store ib=i9 


&rr [0] [i8] 
[i0] [0] 
load ps0 
load i6 
&rr[i8][i8] 
é&dn[i8] 
&rxr[i8] [0] 
i1]) [0] 
12) [0] 
oad alpo 
oad i7 

0) [19] 


&rr[i0] [i8 


load dn[i8 
é&dn[i9] 
érr[il] [i8 


&érr[i2] [i8 
13] [0] 


&érrv[i9 
14] [0] 
15] [0] 


load rr[i8] [i8] 


load rr[i0] [i8] 


load rr[il] [i8] 
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Example A-—30. Assembly Code for the search_10i40 Index Search (Continued) 


LDH 
SADD 
SMP Y 


OUTERLOOP : 


-M1X 


*B9++[5],A5 
256,A0 
A9,B2,A9 
Al,A0,Al 


*A9++[5],B12 
B11,B2,B10 
7,A2 

512,B0 
A15,B2,A15 
B8,B12,B4 
B5,B0,B5 


*A15++[5],A5 
AO,1,A0 


B12,16,Bll 
Al,B2,Al 
A5,A0,A8 


*B10++[5],A5 
-1,A3 
A5,A0,A8 


*A1++[5],B12 
B5,B2,Bll 
AO,7,A13 
Al1,A8,A11 
B15,B11,B15 
A5,A0,A8 


*B11++[5],A5 
Al11,A8,A11 
A5,A0,A8 


*A4++[5],A5 
*B4++[5],B12 
Al1,A8,A11 
B13,5,B13 
B12,A0,A8 


*B14++[5],Bl2 
Al11,A8,A11 
A5,A0,A8 


[18] 


outer loop counter 


BO=_1_64 
&rr[i4] [i8 
&rr[i8] [i9 
[i7] [0 


load rr[i4 
_1_ 64 


dn[i8] << 1] 
&rr[i6] [i8 
smpy (rr[i8 


load rr[id 
sq=-1 
smpy (rr[i0 


load rr[ié 
&rxr[i7] [i8 
alp=0x10000 


18],_1_128) 


i8] 


i8],_1_64) 


i8] 


alpl=sadd(alp0, smpy (rr[i8] [i8],_1_128) ) 


psl 
smpy (rr[il] 


load rr[i7] 


i8],_1_64) 


i8] 


alpl=sadd(alpl, smpy (rr[i0] [i8],_1_64) ) 


smpy (rr[i2] 


load rrv[i9 
load rr[i8 


i8],_1_64) 


] 
[19] 


alpl=sadd(alpl,smpy (rr[il] [i8],_1_64) ) 


smpy (rr[i3 


load dn[i9 


[i8],_1_64) 


alpl=sadd(alpl, smpy (rr[i2] [i8],_1_64) ) 


smpy (rr[i4 


[i8],_1_64) 


— 
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Example A—30. Assembly Code for the search_10i40 Index Search (Continued) 


sernYy 


B10, *+A6[4] 
Al1,A8,A11 
A5,A0,A8 


Al, *+A6[5] 
AO, 6,A10 

Al1,A8,A11 
B12,A0,A8 


*A4++[5],A5 
*B4++[5],B12 
AO, 3,A0 
Al1,A8,A11 
A5,A0,A8 


*B14++[5],B12 


Al1,A8,A11 
A5,A0,A5 
B12,B0,B12 
B11, *+A6[6] 
B12,16,Bl 
Al1,A8,A11 
All, *+A6[3] 
Al1,A5,A5 
B11,B15,B5 


*A4++[5],A5 
*B4++[5],B12 
INNERLOOP 
A5,B12,Al 
B5,B5,B8 


*B14++[5],B12 
4,Al 

0,B2 
Al,A10,A8 
A5,A0,A5 
B12,B0,B12 


’ 


’ 


7** load rrv[i9] 


’ 


’ 


yee Load dn[i9] 


, 


’ 


’ 


, 


;* load dn[i9 


;* smpy (rr[i8] [i9],_1_64) 


store &rr[i5] [i8+5] 
alpl=sadd(alp1, smpy (rr[i3] [i8],_1_64) ) 
smpy (rr[i5] [i8],_1_64) 


store &rr[i6é] [i8+5] 

0x8000L 
alpl=sadd(alp1, smpy (rr[i4] [i8],_1_64) ) 
smpy (rr[i6] [i8],_1_64) 


* load rrv[i9] 
* load rr[i8] [19] 
A0=_1_8 
alpl=sadd(alp1, smpy (rr[i5] [i8],_1_64) ) 
smpy (rr[i7] [i8],_1_64) 


alpl=sadd(alpl, smyp (rr[i6] [i8],_1_64) ) 
smpy (rrv[i9],_1_8) 
smpy (rr[i8] [i9],_1_64) 


store &rr[i7] [i8+5] 
dn[i9] << 16 
done alpl=sadd(alpl,smpy (rr[i7] [i8],_1_64) ) 


store alpl 
alp2=sadd(alpl,smpy (rrv[i9],_1_8) ) 
ps2=sadd (ps1, dn[i9]<<16) 


** load rr[i8] [19] 

branch to the innerloop 
alp2=sadd(alp2,smpy (rr[i8] [i9],_1_64) ) 
sq2=smpyh (ps2,ps2) 


innerloop counter 


alp_16 = sacc(alp2, 0x8000L) 
* smpy (rrv[i9],_1_8) 
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Example A-30. Assembly Code for the search_10i40 Index Search (Continued) 


INNERLOOP : 


LDW 
STH 
SHL 
ADD 
SMP YH 
SMP YH 


B2] 
[B2] 


[B2] 


-S1X 


~S2X 


*+A6[3],A11 
B13, *+B6[9] 
B12,16,B10 
B13,5,B13 
A8,A3,A11 
B8,A13,B10 


B11, *+A6[1] 
A7,*+B6[8] 
B5,Bl11 
A11,A5,A5 
B10,B15,B5 


*A4++[5],A5 


*B4++[5],B12 


Al,1,Al 
INNERLOOP 
A5,B12,Al1l 
B10,A11,B2 
B5,B5,B8 


A8,A13 


*B14++[5],B12 


B8,A3 
A11,A10,A8 
A5,A0,A5 
B12,B0,B12 


B11, *+A6[1] 
A7,*+B6[8] 
B12,16,B10 
B5,Bl11 
A8,A3,A11 
B8,A13,B10 


*+A6[2],A11 
B13, *+B6[9] 
A6,B2 
A11,A5,A5 
B10,B15,B5 


, 
, 


, 


, 


, 


load alpl 
store ib=i9 


,* dn[i9]<<16 


19=i19+STEP 
smpyh (alp_16,sq) 
smpyh (alp, sq2) 


store ps 
store ia = i8 


; *alp2=sadd(alpl, smpy (rrv[i9],_1_8) ) 
;* ps2=sadd(psl1,dn[i9]<<16) 


7*** load rrv[i9+10] 
PRe®. load rele) [aortol 


, 
, 
, 
, 


, 


, 


decrement innerloop counter 
branch to INNERLOOP 


; *alp2=sadd(alp2, smpy (rr[i8] [i9],_1_64) ) 


if smpyh(alp,sq2) > smpyh(alp_16,sq) 


;* sq2=smpyh (ps2,ps2) 


alp=alp_16 


7*** load dn[i9+10] 


, 
, 


, 


;* alp_16=sadd(alp2, 


sq=sqz2 
0x8000L) 


;*** BO = _1_8 


;*** BO = _1 64 


, 


end of innerloop 


store ps 

store ia = i8 
dn[i9]<<16 

ps2 

smpyh (alp_16, sq) 
smpyh (alp, sq2) 


load alp0d 

store ib=i9 

stack pointer 
alp2=sadd(alp2,smpy (rr[i8] [i9],_1_64) ) 
ps2=sadd (ps1, dn[i9]<<16) 
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Example A—30. Assembly Code for the search_10i40 Index Search (Continued) 


*4+A6[5],Al 
*B2,B15 
205,A0 
A5,B12,A11l 
B10,Al1,B2 
B5,B5,B8 


*++A12[A0],A5 
*B7++[5],Bl2 
A8,A13 
-90,B14 

B8,A3 


*+A6[4],B10 
*B3++[5],A5 
-90,A4 
A11,A10,A8 
B13,5,B13 


*A14++[5],A5 
B13, *+B6[9] 
A8,A3,A10 
B8,A13,B10 


256,A0 
*+A6[6],B11 
*B9++[5],A5 
OUTERLOOP 


*A9++[5],B12 
A7,*+B6[8] 
B13,5,B13 
B10,A10,B0 


*A15++[5],A5 
B13, *+B6[9] 
AO,1,A0 
=35,;B13 
A8,A13 
A5,A0,A 


, 


_1_128 


&rxr[i6] [i8] 
load ps0 


alp2=sadd(alp2,smpy (rr[i8] [i9],_1_64) ) 
if smpyh(alp,sq2) > smpyh(alp_16,sq) 
sq2=smpyh (ps2,ps2) 


load rr[i8] [i8] 
load dn[i8] 
alp=alp_16 
é&dn[i9] 

sq=sq2 


&rr[i5] [i8 
load rr[i0d 
&rrv[i9] 

alp_16=sadd(alp2, 0x8000L) 


] 
1 [18] 


load rr[il] [i8] 
store ib=i9 
smpyh (alp_16,sq) 
smpyh (alp, sq2) 


&rr[i7] [i8 
load rr[i2][i8] 
branch to OUTERLOOP 


load rr[i3][i8] 
store ia = i8 

update i9 
if smpyh(alp,sq2) > smpyh(alp_16,sq) 


load rr[i4][i8] 
store ib=i9 


_1_64 
update i9 
alp=alp_16 


smpy (rr[i8] [i8],_1_128) 
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Example A-—30. Assembly Code for the search_10i40 Index Search (Continued) 


[BO] 


BO] 


NnNnn PN 
iw) 
iw) 
x 


-D1B11, *+A6[1] 
-D2 
-S1X 
~S2 
Ll 


*B10++[5],A5 
B8,A3 
B12,16,B11 
A2,1,A2 
A5,A0,A8 


*A1++[5],B12 
A7, *+B6[8] 
310,B4 
A11,A8,A11l 
B15,B11,B15 
A5,A0,A8 


B5,*+A6[1] 
*B11++[5],A5 
A7,5,A7 
A11,A8,A11l 
A0,BO 
A5,A0,A8 


decrement OUTERLOOP counter 


store ps 

load rr[i5][i8 
sq=sq2 

dn[i8] << 16 
smpy (rr[i0] [i8 
load rr[i6][i8 
store ia = i8 
&rr[i8] [i9 
alpl=sadd(alp0, 
psl = sadd(ps0, 
smpy (rr[il] [i8 
store ps 

load rr[i7] [i8 
update i8 
alpl=sadd(al 
_1_64 

smpy (rr[i2] [i8 


,—1_64) 


smpy (rr[i8] [i8],_1_128) ) 
dn[i8]<<16) 
,_1_64) 


pl,smpy (rr[i0] [i8],_1_64) ) 


,—1_64) 
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A.2.6 Implementation of the FIR Filter, residu.c, in GSM EFR Vocoder 


Example A—31 shows the C code for the FIR filter, residu.c, in the GSM EFR 
vocoder. 


Example A-31. C Code for residu.c 


#define Word1l6 short #define Word32 int 
Original C code 
/* m = LPC order == 10 */ #define m 10 
void Residu ( 
Word16 a[], /* (i) : prediction coefficients ae 
Word16 x[], /* (i) : speech signal %/ 
Wordl6 y[], /* (0) : residual signal */ 
Wordl6 lg /* (i) size of filtering */ 
) 
{ 
Wordl6 i, j; 
Word32 s; 
for (i = 0; i < lg; i++) 
{ 
s = Lmult (x[i], a[0]); 
for (j = 1; j <= m; jtt) 
s = Lmac (s, alj], xli - j]); 
s = Doshi. (s;. 3)7 
y[i] = round (s); 
} 
return; 
} 
where L_mult (a,b) = _smpy (a,b) 
L_mac(a,b,c) = _sadd(a,_smpy (b,c) ) 
L_shl(a,b) = (b>0) ? _sshl(a,b) : a >> (-b) 
round(a) = _sadd(a,0x8000L) >>16 
and lg = 40. 


A.2.6.1 Rearranging the C Code 


L_shl (s, 3) can be implemented simply as _sshl (s,3). Because array a has 
dimension m + 1 = 11 and the inner loop is always executed 10 times per outer 
loop iteration, you can completely unroll the inner loop to gain speed by 
representing array a with registers. Because a is a short integer array, it 
requires six registers at most for full representation. You can assign one 
register only for a[0] for the following reasons: 


Lj a0] is always a constant, 4096 
LJ _shr (Ox8000L, 3) = 4096 
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You can change the order of rounding and left shift to save one register. (Other- 
wise, you need another register for Ox8000L.) The C code, after complete 
inner loop unrolling, is shown in Example A-32. 


Example A-32. C Code for residu.c After Rearrangement Using Intrinsics 


for (i = 0; i < 
{ 
= _smpy (x 
= _sadd(s 
= _sadd(s 
= _sadd(s 
= _sadd(s 
= _sadd(s 
= _sadd(s 
_sadd(s 
= _sadd(s 
= _sadd(s 
= _sadd(s 
= Nsadentat 
= _sshl(s 
shr 


KM nNADNDNANHHNANAAHAAUHYH NAR 
ll 


bogie. L++) 

[i], alO]); 
_smpy(a[1l], x[i-1])); 
_smpy(a[2], x[i-2])); 
_smpy(a[3], x[i-3])); 
_smpy(a[4], x[i-4])); 
_smpy(a[5], x[i-5])); 
_smpy(a[6], x[i-6])); 
_smpy(a[7], x[i-7])); 
_smpy(a[8], x[i-8])); 
_smpy(a[9], x[i-9])); 
—smpy(a[10], x[i-10])); 

, alO]); 

3); 

(5 16); 


A.2.6.2 Performance Analysis 
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The performance is limited by the .L unit for _sadd because this unit is used 
at least 11 times per iteration. In other words, it takes at least six cycles per 
iteration. You may choose to unroll the loop once to compute two y values per 
iteration for the following reasons: 


[1 To satisfy the ordering property of _sadd 
_j To maximize speed: eleven cycles are required to compute two y values, 
while six cycles are needed for one y 


The C code is is shown in Example A-33. 


Implementation of the GSM EFR Vocoder 


Example A—33. Implemented C Code for residu.c 


for (i = 0; i < lg; i+=2) 

{ 
sO smpy(x[i], a[0]); 
sl = _smpy(x[it+l], a[0]); 
sQ = _sadd(s0,_smpy(a[1l], x[i-1])); 
sl = _sadd(sl,_smpy(a[l], x[i])); 
s0O = _sadd(s0,_smpy(a[2], x[1i-2])); 
sl = _sadd(sl,_smpy(a[2], x[i-1])); 
s0O = _sadd(s0,_smpy(a[3], x[1i-3])); 
sl = _sadd(sl,_smpy(a[3], x[i-2])); 
s0O = _sadd(s0,_smpy(a[4], x[i-4])); 
sl = _sadd(sl,_smpy(a[4], x[1i-3])); 
sQO = _sadd(s0,_smpy(a[5], x[i-5])); 
sl = _sadd(sl,_smpy(a[5], x[i-4])); 
sO = _sadd(s0,_smpy(a[6], x[i-6])); 
sl = _sadd(sl,_smpy(a[6], x[1i-5])); 
s0O = _sadd(s0,_smpy(a[7], x[i-7])); 
sl = _sadd(sl,_smpy(a[7], x[i-6])); 
s0Q = _sadd(s0,_smpy(a[8], x[i-8])); 
sl = _sadd(sl,_smpy(a[8], x[i-7])); 
sQ = _sadd(s0,_smpy(a[9], x[1i-9])); 
sl = _sadd(sl,_smpy(a[9], x[i-8])); 
sO = _sadd(s0O,_smpy(a[10], x[i-10])); 
sl = _sadd(sl,_smpy(a[10], x[i-9])); 
sO = _sadd(s0, a[0]); 
sl = _sadd(sl, a[0]); 
sO = _sshl(s0,3); 
sl = _sshl(s1,3); 
y[i] = _shr(s0,16); 
yf{itl] = _shr(s1,16); 

} 


A.2.6.3 Final Assembly Code for residu.c 


The final assembly code is shown in Example A-34. 
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Example A-34. Assembly Code for residu.c 


Kk 


Kk 


Kk 


Kk 


Kk 


Kk 


Kk 


K* 


Kk 


Kk 


Implementation 
Compute two ys 
Total cycles = 


= 237 
Register Usage: 


SMP Y 
SMP YHL 
LDW 
SADD 
SADD 


SMP YHL 
SMP Y 
SADD 
SADD 


of residu.c EFR 


at a time 


(1lg/2+1) *11+6 
(for lg 


, 
, 
, 


, 


A3,A0,A8 
A3,B0,B8 
*A4--, Al 
A8,A9,A9 
B8,B9,B9 


Al,B4,A8 
A3,B4,B8 
A8,A9,A9 
B8,B9,B9 


A4 --- 


&a[lO] 
&x [0] 
&y [0] 
19g 
oad a[0] = 4096 
oad x[0] & x[1 
oad all] & 2 
load x[-2] & x[-1] 
oad a[3] & a[4 
oad a[5] & a[6 
oad a[7] & a[8 
oad x[-4] & x[-3] 
load a[9] & a[10] 
to take care of the first execution 
a[0] = 4096 
loop counter, L_SUBFR/2 
smpy (x[0],a[0]) 
smpy (x[1],a[0]) 
load x[-6] x[-5] 


& 
sO = sadd(s0, 


sl = sadd(s1 


smpy (x 
smpy (x[0],a 


sO = sadd(s0, 


sl = sadd(s1 


, smpy (x[-8],al[9])) 


[-1],a[1]) 


1]) 


KKKK KKK KKK KKK KKK KKK KKK KKK KKK KKK KKK KKK KKK KKK KK KKK KKK KKK KK KKK KKK KKK KK KKK KKKKKKKK KKK KK 


Kk 


Kk 


Kk 


Kk 


Kk 


Kk 


Kk 


Kk 


Kk 


Kk 


KKKK KK KKK KKK KKK KKK KKK KKK KKK KKK KKK KKK KKK KKK KKK KKK KKK KKK KK KKK KKK KK KKKKKKKKKKKKKKKKKKK 


A-54 


Implementation of the GSM EFR Vocoder 


Example A—34. Assembly Code for residu.c (Continued) 


pointer 
[!A2] 
[!A2] 


SMPYLH 
SMP YH 
ADD 
ADD 
LDW 
SADD 
SADD 


SMPYHL 
SMPY 
SADD 


SADD 
SSHL 
SSHL 


SMPYLH 
SMP YH 
SADD 
SADD 
LDW 


SHR 
SHR 


SMPYHL 
SMPY 
SADD 
SADD 
STH 


SMPYLH 
SMP YH 
SADD 
SADD 
LDW 


SMPYHL 
SMPY 
SADD 
SADD 
LDW 


.M1X 
-M2X 
euswle 
S2 


Al,B4,A8 
Al,B4,B8 
A8,0,A9 
B8,0,B9 
*A4--, A3 
A9,A0,A9 
B9,BO,B9 


A3,B1,A8 
Al,B1,B8 
A8,A9,A9 


B8,B9,B9 
A9,3,A7 
B9,3,Bl 


A3,B1,A8 
A3,B1,B8 
A8,A9,A9 
B8,B9,B9 
*A4++[6],Al 


A7,16,A7 


B10,16,B10 


Al1,B5,A8 
A3,B5,B8 


7 smpy (x a 
7 smpy (x a 
7; sO=smpy ( ] 
7; sl=smpy ( i 
; load x[-8] & 

0 

1 


; sO 
7S 


7 smpy (x 
7 smpy (x 


7; sO 


aah 
i380 
psi 


7 smpy(x[-4], 
7 smpy (x 


; sO 
posi 


7 y[0] 
7 yl) 


; to the new &x[0] 


7 smpy(x[-5],al[5]) 
7; smpy(x[-4],a[5]) 
; sO = sadd(s0O, smpy(x[-3],a[3])) 
; sl = sadd(sl, smpy(x[-2],a[3])) 


; store y[0] 
7 decrement loop counter 
> branch to the loop 


; smpy(x[-6],a[6]) 
7; smpy(x[-5],a[6]) 
; sO = sadd(s0O, smpy(x[-4],a[4])) 
; sl = sadd(sl, smpy(x[-3],a[4])) 


7* load x[0] & x[1] for the next iteration 


7 smpy(x[-7], 
7 smpy (x 
; sO = sadd( 


7 Sl 


7* load x[-1l 


[=2 
bal 
x 
x 


[-3],a[3]) 
[-2],a[3]) 
sadd(s0, smpy(x[-1],a[1])) 


sadd(sl, smpy(x[0],a[1])) 
L_shl(s0,3) 
L_shl(s1,3) 


al4]) 

[-3],a[4]) 

sadd(s0, smpy(x[-2],a[2])) 
sadd(sl, smpy(x[-1],a[2])) 
x[-10] & x[-9] and update the 


= shr(s0, 16) 
= shr(sl, 16) 


al7]) 

a[l7]) 

0, smpy(x[-5],a[5])) 
1, smpy(x[-4],a[5])) 
& x[-2] 


[-6] 


s 
sadd(s 
] 
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Example A-34. Assembly Code for residu.c (Continued) 


SMP YLH 
SMP YH 
SADD 
SADD 
[!A2] STH 


SMP YHL 
SMP Y 
SADD 
SADD 
[A2] SUB 
LDW 


SMP YLH 
SMP YH 
SADD 
SADD 


1x A3,B6,A8 ; smpy(x[-8],a[8]) 
.M2X A3,B6,B8 ; smpy(x[-7],al[8]) 
etal A8,A9,A9 ; sO = sadd(s0O, smpy(x[-6],a[6])) 
~L2 B8,B9,B9 ; sl = sadd(sl, smpy(x[-5],a[6])) 
2B B10, *A6++ ; store y[1l 

1X Al,B7,A8 ; smpy (x[-9],al[9]) 
-M2X A3,B7,B8 7 smpy(x[-8],a[9]) 
Palak A8,A9,A9 7 sO = sadd(s0O, smpy(x[-7],aI[7])) 
.L2 B8,B9,B9 ; sl = sadd(sl, smpy(x[-6],a[7])) 
22 AD Tesh? 
2Dd *A4—-—-,A3 ;* load x[-3] & x[-4] 

x Al,B7,A8 7 smpy(x[-10],a[10]) 

2X Al,B7,B8 7 smpy (x[-9],a[10]) 
ud A8,A9,A9 ; sO = sadd(s0O, smpy(x[-8],a[8])) 
22 B8,B9,B9 ; sl = sadd(sl, smpy(x[-7],a[8])) 
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There is no memory bank hit within the loop. To avoid a memory bank hit within 
the prolog of the loop, arrays a and x must be allocated so that a[1] and x[0] are 
offset from each other by one word. Some of the instructions in the loop cannot 
be executed in the first iteration. Register A2 indicates which instructions these 
are. 


Implementation of the Lag Search in the lag_max ( ) Routine 


The lag_max ( ) routine performs an open-loop pitch (or lag) search and 
computes the normalized correlation for the selected lag. This section 
illustrates the implementation of the lag search. The lag search C code is 
shown in Example A-35. 


Example A-35. C Code for the Lag Search in lag_max() 


Implementation of the GSM EFR Vocoder 


#define Wordl6 short 
#define Word32 int 

#define MIN_32 0x80000000L 
#define PIT_MAX 143 

#define L_FRAME 160 

input: 


Word1l6 scal_sig[PIT_MAX+L_FRAME]; 
Word1l6 scal_fac; 


Word16 L_frame, lag_min, lag_max; 
local variables: 

Wordl6 i, Jj, *p, *pl, p_max; 

Word32 t0, max; 


return: 
Word1l6 p_max; 


Original C code 


max = MIN_32; 
for (i = lag_max; i >= lag_min; i--) 
{ 
p = scal_sig; 
pl = &scal_sig[-il; 
tO = 0; 
for (j O; 3 < L_frame; j++, ptt, pltt) 
{ 
to L_mac (t0, *p, *pl); 
} 
if (Lesub (t0, max) >= 0) 
{ 
max = t0; 
p_max i; 
} 
} 
where L_mac(a,b,c) = _sadd(a,_smpy (b,c) ) 
L_sub(a,b) = _ssub(a,b) 
L_frame = L_FRAME/2 = 80 
and the search range (lag_min, lag_max) is (18,35) 


(pointed at scal_sig[PIT_MAX] 
(not used in this part of the code) 


when passed) 


re (S677 by OT’ C125-143).. 
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A.2.7.1 Rearranging The C Code and Unrolling The Loops 


This algorithm is preferable to smaller lag candidates, because it performs a 
comparison with if(L_sub (t0,max) >= 0) and the search starts from lag_max. 
Because there is not a single instruction for the >= (or <=) comparison, you can 
change the search order to start from lag_min to compare with if (t0 > max); 


p_max is initialized to lag_min. The C code is modified as shown in 
Example A-36. 


Example A-—36. C Code for the Lag Search in lag_max ( ) (Comparison Order Changed) 


if (tO > max) 


Apt, *pl+#) 


max = 
p_max lag_min; 
for (i lag_min; i < lag_max; i++) 
{ 

p = scal_sig; 

pl = &scal_sig[-i]; 

tO = 0; 

for (j=0; j<L_frame; jtt+, 

tO = L_mac(t0, *p, *pl); 
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Next, look at the inner loop, a general MAC loop. Because “p does not always 
equal *p1, it does not fall into the special case described in section A.2.1, /mple- 
mentation of the Multiply-Accumulate Loop, beginning on page A-4. Therefore, 
the performance cannot be improved by simply unrolling the inner loop. 


Now consider unrolling the outer loop once. The C code with outer loop 
unrolling is shown in Example A-37. Because the number of lags that needs 
to be searched within each search range is always even, such unrolling does 
not create an additional case to handle. 


Example A-37. C Code for the Lag Search in lag_max() With Outer Loop Unrolling 


Implementation of the GSM EFR Vocoder 


Word32 t1; 


max = MIN_32; 


p_max = lag_min; 
for (i = lag_min; i < lag_max; i+t=2) 
{ 

p = scal_sig; 

pl = scal_sig[-i]; 

tO = 0; 

tl = 0; 


for (j=0; j<L_frame; j++, ptt, plt+t) 


t0=_sadd(t0,_smpy (*p, *p1)); 
} 
if (tO > max) 


max = t0; 
p_max = i; 
} 
if( tl > max) 
{ 
max = tl; 
p_max = itl; 


with intrinsics substitutes. 


tl=_sadd(t1,_smpy(*p,*-pl)); (or tl=_sadd(t1,_smpy (scal_sig[j],scal_sig[-i-1+]j])) 
(or t0=_sadd(t0,_smpy (scal_sig[4j],scal_sig[-i+j])) 


The smaller lag is always compared first in the order of the comparisons. 


The instructions required for one iteration of the inner loop are shown in 


Example A-38. 


Example A—38. Linear Assembly for the Lag Search in lag_max() Inner Loop 


INNERLOOP: 

LDH 2D. *p++, SLO] 

LDH oD *-pl, scalijl 

SMP Y .M sigj,scalijl,tmpl 

SADD -L t1,tmpl1,tl 

LDH .D *pl++,scalij 

SMPY .M sigj,scalij,tmp0 

SADD oL t0,tmp0,t0 
[icntr] SUB Baisy Lenker, 1, Cnt Fe 
[icntr] B as! INNERLOOP 


load scal_sig[j] 
load scal_sig[-i-1+j] 
smpy (Sscal_sig[j],scal_sig[-i-1+]j]) 


tl=sadd(t1,smpy (scal_sig[j],scal_sig[-i-1+)j]) 


load scal_sig[-itj] 
ampy (scal—sig( j],scal_sig(-147]) 


t0=sadd(t0,smpy(scal_sig[j],scal_sig[-itj]) 


decrement inner loop counter 
branch to inner loop 
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The .D unit is used the most (three times). Therefore, the inner loop takes two 
cycles. 


Now unroll the inner loop once. The first iteration of t1 and the last iteration of 
tO perform outside the inner loop. This avoids memory bank hits. The C code 
with the inner and outer loops unrolled is shown in Example A-39. 


Example A—39.C Code for the Lag Search in lag_max() With Inner and Outer Loops Unrolled 


Word32 til; 


max = MIN_32; 


p_max = lag_min; 
for (i = lag_min; i < lag_max; it=2) 
{ 
p = scal_sig; 
pl = scal_sig[-il; 
tO = 0; 
tl=_sadd(t1l,_smpy(*p,*-pl)); (or tl=_sadd(t1,_smpy (scal_sig[j],scal_sig[-i-1+]j]) 
for (j=0; j<(L_frame-1); jt=2, pt=2, pl+=2) { 
t0=_sadd(t0,_smpy(*p,*pl)); (or t0=_sadd(t0,_smpy(scal_sig[j],scal_sig[-it+j])) 
E 
© 
t 


=_sadd(t1l,_smpy(*+p,*pl)); (or tl=_sadd(t1,_smpy (scal_sig[j+1],scal_sig[-i+j]) 
0=_sadd(t0,_smpy(*+p,*+pl)); (or t0=_sadd(t0,_smpy (scal_sig[j+1],scal_sig[-i+j+1]) ) 
=_sadd(t1l,_smpy (*+p[2],*+pl1)); (or tl=_sadd(t1,_smpy (scal_sig[j+2],scal_sig[-i+j+1]) ) 


} 
t0=_sadd(t0,_smpy (scal_sig[L_frame-1],scal_sig[-i+tL_frame-1])); 
if (tO > max) { 
max = t0; 
p_max = i; 
} 
if( tl > max) { 
max = tl; 
p_max = itl; 


Although five values of scal_sig, scal_sig[j], scal_sig[j+1], scal_sig[j+2], 
scal_sig|—i+j], and scal_sig[—i+j+1], are required for each inner loop 
iteration, scal_sig[j] does not need to be loaded, because it was loaded in the 
previous iteration. This means only four loads are required per iteration. 
Example A—40 gives the instructions for the modified inner loop. 
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Example A—40. Linear Assembly for the Lag Search in lag_max() Inner Loop 


INNERLOOP: 
LDH 
SMPY 
SADD 
LDH 
SMPY 


LDH 
SMPY 
SADD 
LDH 


SMPY 

SADD 
[icntr] SUB 
[icntr] B 


+p, 
*—-pl, 
sigj, 


sigj 


*plt++, sca 


scalijl 
Sscaliyi,e. 


lij 


sigj,scalij,tmp0 


t0,tmp0,to 
*pt+t+, 
sigj+l,scal 
tl1,tmpl,tl 


sigjt+ 


1 
Lij,tmpl 


apl++, scala j+l 


sigj+l,scal 
t0,tmp0,to 
*pt+, 


sigj+2,scal 
t1,tmpl,tl 


lijtl,tmpd 


sigjt2 


Lijt+1,tmpl 


icntr,2,icnt 


INNERLOOP 


load scal_sig[j] 
load scal_sig[-i-1+]j] 
tl=smpy (scal_sig[j],scal_sig[-i-1+)j]) 


load scal_sig[-itj] 

smpy (scal_sig[j],scal_sig[-itj]) 
t0=sadd(t0,smpy(scal_sig[j],scal_sig[-itj]) 
load scal_sig[jtl 
smpy (scal_sig[j+1],scal_sig[-it+j]) 
tl=sadd(t1,smpy (scal_sig[j+1],scal_sig[-itj]) 
lead. scal_sig[=i+j+1] 
smpy (scal_sig[j+1],scal_sig[-i+j+1]) 
t0=sadd(t0,smpy (scal_sig[j+1],scal_sig[-it+jt+1]) 
load scal_sig[j+2], the scal_sig[j] for the 
next iteration 
smpy (scal_sig[j+2],scal_sig[-i+j+1]) 
tl=sadd(t1,smpy (scal_sig[j+2],scal_sig[-it+jt1]) 
decrement inner loop counter 

branch to inner loop 


The inner loop uses two cycles. You double the performance, therefore, by 
unrolling both the outer loop and inner loop if no memory bank hits occur. 


A.2.7.2 Avoiding Memory Bank Hits 


A.2.7.3 Final Assembly Code for Lag Search 


Load scal_sig[—i+j] and scal_sig[j+1] together and scal_sig[—i+j+1] and 
scal_sig[j+2] together to avoid memory bank hits. Memory bank hits can also 
be avoided by loading scal_sig[—i+j] and scal_sig[—i+j+1] together and 
scal_sig[j+1] and scal_sig[j+2] together. 


The final assembly code for the lag search segment is shown in 
Example A-41. 
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Example A-41. Assembly Code for the Lag Search in lag_max() 


KKK KKK KKK KKK KKK KKK KKK KKK KKK KKK KK KKK KKK KKK KKK KKK KKK KKK KKK KKK KK KKK KKKKKKKKKAKKKKK KKK KKK 
bk k* 
a Implementation of residu.c EFR ae 
k* k* 
es Compare two lags a time es 
k* k* 
** Total cycles = 7+(L_frame+6) * (lag_max-lag_mint1) /2 ae 
k* k* 
sas Register Usage: A B pie 
k* 10 9 k* 
kk k* 
KKK KKK KKK KKK KKK KKK KKK KKK KKK KKK KKK KKK KKK KKK KKK KK KKK KKK KKK KKK KKK KKK KKEKKKKKKKKKKKK KKK KK 
; A4 --- &scal_sig 
; A6 --- lag_max 
; B6 --- lag_min 
SUBAH eD1 A4,A6,A7 ; pl=&scal_sig[-LAG_MIN] 
MVK .S2 1,B2 
SUB -L1X B6,A6,Al1 ; the outer loop counter 
MV ~L2X A4,B7 ; p=&scal_sig[0] 
MPY -M2 BO,0,BO ; initialize the comparison result 
MPY -M1 A2,0,A2 ; take care the initial iteration 
MV oS A6,A4 7 P_max = lag_min 
SHL 382 B2,31,B2 7 max=MIN_32=0x80000000L 
LDH 2D: *-A7[1],A5 ; scal_sig[-LAG_MIN-1] 
LDH -D2 *B7,B5 ; scal_sig[0] 
ADD -L1 Al,1,Al1 ; make the counter to be an even number 
OUTERLOOP: 
LDH 2D. *A7,A5 ; scal_sig[-LAG_MIN] 
LDH *D2 *+B7[1],B6 ; scal_sig[1] 
[A2] SADD we B10,B8,B10 
[Al] MV ~S2 37,Bl1 ; inner loop counter 
MPY -M1 A3,0,A3 
MPY -M2 B8,0,B8 
ADD .S1 AT,2,A9 ; &scal_sig[-LAG_MIN+1] 
SUB Pra A7,4,A7 ; update pl = &scal_sig[-LAG_MIN-2] 
LDH -D1 *A9++,A5 ; scal_sig[-LAG_MIN+1] 
LDH -D2 *+B7[2],B5 ; scal_sig[2] 
[B1] B a0: INNERLOOP ; branch to the inner loop 
[A2] CMPGT .L2 B10,B2,B0 ; if (t0>max) 
LDH oul *A9++,A5 ; scal_sig[-LAG_MIN+2] 
LDH -D2 *+B7[3],B6 ; scal_sig[3] 
[BO] MV ~L2 B10,B2 ; max = tO 
PY -M1X Bl, 1,A2 ; counter to branch to the outerloop 
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Example A-41. Assembly Code for the Lag Search in lag_max() (Continued) 


LDH sD *A9++,A5 scal_sig[-LAG_MIN+3] 
LDH .D2 *+B7[4],B5 scal_sig[4] 
[B1] B 382 INNERLOOP branch to the inner loop 
[A2] CMPGT ~L2X A0,B2,BO if (t1>max) 
[BO] SUB wad: A6,2,A4 pmax = i 
ADD .S1 A6,2,A6 update i 
MPY -ML AO0,0,A0 initialize t1=0 
MPY .M2 B10,0,B10 initialize t0=0 
LDH -D1 *A9++,A5 scal_sig[-LAG_MIN+4] 
LDH -D2 *+B7[5],B6 scal_sig[5] 
SMP Y -M1X A5,B5,A3 _smpy (scal_sig[-LAG_MIN-1], scal_sig[0]) 
[BO] MV eK A0,B2 max = tl 
[BO] SUB -L1 A6,3,A4 p_max = itl 
[Al] SUB gol Al,2,Al1 update inner loop counter 
ADD $2 BY, 12,B9 &scal_sig[1] 
INNERLOOP : 
LDH -D1 *A9++,A5 scal_sig[-LAG_MIN+5] 
LDH -D2 *B9++,B5 scal_sig[6] 
SMPY -M1X A5,B6,A3 _smpy (Sscal_sig[-LAG_MIN], scal_sig[1]) 
SMP Y -M2X A5,B5,B8 _smpy (Sscal_sig[-LAG_MIN], scal_sig[0]) 
SADD -L1 AO,A3,A0 update tl 
SADD ~L2 B10,B8,B10 update t0 
Bl] B .S1 INNERLOOP branch to inner loop 
B1] SUB S2 BL, 1,3 decrement inner loop counter 
LDH -D1 *A9++,A5 scal_sig[-LAG_MIN+6] 
LDH -D2 *B9++,B6 scal_sig[7] 
SMPY .M1X A5,B5,A3 _smpy (scal_sig[-LAG_MIN+1], scal_sig[2]) 
SMP Y -M2X A5,B6,B8 _smpy (scal_sig[-LAG_MIN+1], scal_sig[1]) 
SADD ~L1 AO,A3,A0 update tl 
SADD sLiZ B10,B8,B10 update t0 
SUB -Sl1 A2,1,A2 decrement the counter to branch to the outer loop 
[!A2]B ~S2 OUTERLOOP branch to the outer loop 
LDH ~D1 *-A7[1],A5 scal_sig[-LAG_MIN-3] 
LDH -D2 *B7,B5 scal_sig[0] 
SADD -L1 AO,A3,A0 update tl 
SADD ~L2 B10,B8,B10 update t0 
[!A1]B eed FINISH lag search is complete 
FINISH: 
NOP 3) 


Allthe epilogs and prologs of the outer and inner loops are compressed to mini- 
mize the code size. A2 is both the indicator for avoiding comparisons during 
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the initial iteration of the outer loop and the counter for branching to the outer 
loop during inner loop executions. 
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word accesses in 3-13 


-endproc directive 2-24 

energy computation in MAC loop A-6 to A-8 
enhanced full rate (EFR) A-3 

epilog 3-16 

execute packet 2-10, 2-14, 5-22 

execution cycles, reducing number of 5-4 


extraneous instructions, removing 5-24 
SUB instruction 5-29 


File menu (debugger) 2-7 
FIR filter 
Ccode 3-14, 3-15, 5-83, 5-109, 5-112 
with redundant load elimination 5-84 
finalassembly 5-120 
final assembly for inner loop 5-93 
final assembly with redundant load elimina- 
tion 5-89, 5-102, 5-106 
linear assembly 
for outer loop 5-111 
with inner loop unrolled 5-110 
with outer loop conditionally executed with in- 
nerloop 5-114, 5-116 
linear assembly for inner loop 5-85 
linear assembly for unrolled inner loop 5-96, 
5-98 
software pipelining the outer loop 5-104 
using word accessin 3-14 
with inner loop unrolled 5-95, 5-104 
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Index 


flow diagram 
autocorr.c A-9, A-12, A-13 
code development 1-3 


functional units 
description 4-6 
in assembly code 4-7 
reassigning for parallel execution 5-7 


functions 
clock() 3-2 
printf() 3-2 


-—goption 2-5 
global constants/symbols defined in EFR A-3 


global systems for mobile communications 
(GSM) A-3 


if-then-else 
branching versus conditional instructions 5-59 
C code 5-59, 5-67 
finalassembly 5-64, 5-65, 5-72 
linear assembly 5-60, 5-63, 5-68, 5-71 


IIR filter, C code 5-50 

iir1.asm, inner loop kernel 2-15 

iir1.c example code 2-4 

index search in search_10i40 A-38 
information elements in tutorial 2-2 
instructions, placement in assembly code 4-4 
intdatatype 3-2 


intrinsics 
_add2() 3-12 
_mpy() 3-13 
_mpyh() 3-13 
_mpyhl() 3-12 
_mpylh() 3-12 


_nassert 3-18 

described 2-17, 3-9 

in residu.c A-51 to A-53 

in saturated add 3-9 
summary table 3-10 to 3-12 


iteration interval, defined 5-18 
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—k compiler option 2-5, 3-4 
kernel 

loop 2-13, 3-7, 3-16 

of iirl.asm code 2-15 

of iir2.asm code 2-22 

of iir8.asm code 2-28 

of macl.asm code 2-13 

of vec_mpy1.asm code 2-14 

of vec_mpy2.asm code 2-21 


-Ilinker option 2-6 
labels in assembly code 4-2 
lag search in lag_max() A-56 
linear, optimizing (phase 3 of flow), in flow dia- 
gram 1-3 
linear assembly 2-24 
linker command file 2-5 
little endian mode, runtime support 
(rts6201.lib) 2-6 
little-endian mode, and MPY operation 5-11 
live-too-long issues, and software pipelining 3-23 
live-too-long code 5-40 
C code 5-74 
inserting move (MV) instructions 5-78 
unrolling the loop + 5-78 
live-too-long issues 5-74 
created by split-join paths 5-77 
Load Program File dialog box (debugger) 2-7 
load word (LDW) instruction 5-10 
load6x 2-11, 2-12 
long datatype 3-2 
loop carry path, described 5-50 
loop control variable, conditionally increm- 
ented 3-23 
loop counter, handling odd-numbered 3-13 
loop iterations 3-17 
loop kernel 2-13 
loop unrolling 
as major programming method A-2 
dot product 5-10 
for simple loop structure 3-21 
for windowing and scaling in autocorr.c A-9 
if-then-else code 5-67 


loop unrolling (continued) 
incor_h A-22 
in FIR filter 5-95, 5-98, 5-104, 5-109, 5-111 
in lag_max A-58 
in live-too-long solution 5-78 
in vectorsum 3-19 


maci.asm kernel, inner loop 2-13 
maci.c example code 2-3 
memory bank hits 
avoiding A-2 
cor h A-23 
in windowing and scaling in autocorr.c A-15 
memory bank scheme, interleaved 5-91 to 5-93 
memory dependency. See dependency 
—mg compiler option 2-5 
minimum iteration interval, determining 5-19 
for FIR code 5-87, 5-101, 5-119 
for if-then-else code 5-62, 5-70 
for IIR code 5-53 
for live-too-long code 5-77 
for weighted vector sum 5-32, 5-33 
mnemonic (instruction) 4-4 
modulo iteration interval table 
dot product 
after software pipelining 5-20 
before software pipelining 5-18 
IIR filter 4-cycle loop 5-56 
weighted vector sum 
2-cycle loop 5-37, 5-42, 5-45 
with SHR instructions 5-39 
modulo-scheduling technique, multicycle 
loops 5-31 
_mpy () intrinsic 3-13 
_™Mpy intrinsic, tutorial 2-17 
_mpyh() intrinsic 3-13 
_mpyhl () intrinsic 3-12 
_mpylh () intrinsic 3-12 
_mpylh intrinsic, tutorial 2-17 
multiply accumulate function 
inner loop kernel of original assembly 
code 2-13 
original C code 2-3 
multiply-accumulate loop (MAC), implementation in 
vocoder application A-4 
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_nassert intrinsic 3-11, 3-12, 3-18 
node 5-5 


—o compiler option 2-5, 3-4, 3-16, 3-18, 3-22 
—0 linker option 2-6 
operands 
placement in assembly code 4-8 
types of 4-8 
optimizing assembly code, introduction 5-2 
optional tasks in tutorial 2-2 
outer loop conditionally executed with inner 
loop 5-109 
OUTLOOP __ 5-88, 5-101 


parallel bars, in assembly code 4-2 

parent instruction 5-5 

parent node 5-5 

path in dependency graph 5-5 

performance analysis 
index search in search_10i40 A-40 
of Ccode 3-2 
of dot product examples 5-9, 5-15, 5-30 
of FIR filter code 5-101, 5-108, 5-122 
of if-then-else code 5-66, 5-73 
residu.c A-52 

pipeline in ’C62xx 1-2 

—pm compiler option 3-4, 3-5, 3-8, 3-18 

pointer operands 4-8 

preparation for tutorial 2-1 

primary tasks in tutorial 2-2 

priming the loop, described 5-27 

priming the pipeline 3-17 

printf () function 3-2 

processor mnemonics 4-5 

Profile Marking dialog box 2-8 

Profile menu (debugger) 2-7 

Profile Run dialog box 2-9 

profiling 2-7 to 2-12 

program-level optimization 3-5 

programming methods, summary of A-2 

prolog 3-16, 5-27 
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redundant load elimination 5-83 
redundant loops 3-18 
reg directive 2-24, 5-11 
register allocation 5-100 
register operands 4-8 
registers, partitioning A-41 
related documentation — iv 
residu.c A-51 
residu.c (FIR filter in EFR) A-51 
resource conflicts 
described 5-38 
live-too-long issues 5-40, 5-74 
resource table 
FIR filter code 5-87, 5-101, 5-119 
if-then-else code 5-62, 5-70 
IIR filter code 5-53 
live-too-long code 5-77 


routines 
autocorr.c A-7 
cor_h A-20 


lag_max() A-56 
rrv computation in search_10i40 A-27 
rts6201.lib file 2-6 
rts6201e.lib file 2-6 
RUNB debugger command 3-3 


.saextension 2-24 
_sadd intrinsic 3-9, 3-11 
scheduling table. See modulo iteration interval table 
shell program (cl6x) 2-5, 3-4 
short arrays 3-12 
short datatype 3-2, 3-12 
software pipelining 3-16, 3-20 
described 5-16 
when not used 3-22 
software-pipelined schedule 5-20 
source operands 4-8 
split-join path 5-74, 5-75, 5-77 
stand-alone simulator (load6x) 2-11, 3-2 
SunOS shell initialization 2-7 
symbolic names, for data and pointers 5-11 
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techniques 
for priming the loop 5-27 
for refining C code 3-9 
for removing extra instructions 5-24, 5-29 
using intrinsics 3-9 
word access for short data 3-12 


TMS320C62xx pipeline 1-2 


translating C code to ’C62xx instructions 
dot product, unrolled 5-11 
IIR filter 5-51 
with reduced loop carry path 5-55 
weighted vector sum 5-31 
unrolled inner loop 5-33 


translating C code to linear assembly, dot prod- 
uct 5-4 


trip count 2-24, 3-17 
communicating information to the compiler 3-18 
determining the minimum 3-17 


trip counter 
converting to a downcounting loop 3-23 
defined 3-17 


.trip directive 2-24 


vec_mpy1.asm kernel, inner loop 2-14 
vec_mpy1.c example code 2-4 


vector multiply function 

C with word instructions and intrinsics 2-17 

inner loop kernel of assembly from C with intrin- 
sics 2-21 

inner loop kernel of original assembly 
code 2-14 

originalC code 2-4 

tutorial C code example (vec_mpy1.c) 2-4 
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vector sum function 
See also weighted vector sum 


C code with const keyword 3-7 weighted vector sum 
C code with const keyword, nassert 3-18 : 
basic C code 5-31 
C code with const keyword, _nassert, and word 
access 3-12, 3-13 
C code with const keyword, _nassert, word ac- 
cess, and loop unrolling 3-20 
C code with three memory operations 3-19 


C code, unrolled version 5-32 

C code with loop unrolling 5-32 

finalassembly 5-48 

linear assembly for inner loop 5-31 

linear assembly with resources allocated 5-35, 


C code word-aligned 3-19 5-46 
compiler output (original assembly code) 3-8 translating C code to assembly instruc- 
dependency graph 3-6, 3-7 tions 533 


handling odd-numbered loop counter with 3-13 dows : F A-7 
handling short-aligned data with 3-13 WINCOMENO ANG SCN AML GOR: 75, 


rebasic C code 3-5 word access 
rewriting to use word accesses 3-12 in dot product 3-13 to 3-14 
VelociTl 1-2 in FIR filter 3-14 


; ; using for short data 3-12 to 3-15 
very long instruction word (VLIW) 1-2 


VLIW 1-2 k4 
vocoder application A-1 


vocoder, implementing A-3 —z compiler option 2-5 
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